
I'm working with Drs. S. Muthu Muthukrishnan
and Dr. Graham Cormode on algorithms for analyzing massive data streams. A
massive data stream is a sequence of information that is so long it cannot be
stored completely in memory, and as a result algorithms on data streams should
not just be time efficient, but space efficient as well. According to Estimating
entropy and entropy norm on data streams by Chakrabarti, Ba, and
Muthukrishnan, "Sublinear space is mandatory and polylogarithmic space is
often the goal."
Specifically, my project involves Streaming
Entropy computation. The entropy of a data stream can be informally viewed as
the amount of information contained in single character of the stream: the more
"random" the data stream, the higher its entropy. One application of
fast entropy computation on massive data streams is in detecting worms and
other anomalies in IP networks. According to Entropy Based Worm
and Anomaly Detection in Fast IP Networks by Wagner and Plattner, "The
connection between entropy and worm propagation is that worm traffic is more
uniform or structured than normal traffic in some respects and more random in
others."
I'm currently working on speeding up an
algorithm for entropy computation presented in A Near-optimal
algorithm for computing the entropy of a stream by Chakrabarti, Cormode,
and McGregor. I’ve implemented three versions of the algorithm: a “slow”
version that adheres to the pseudocode in the aforementioned paper, a “fast”
version, and a “naïve” version that does not take backup samples and hence is
not guaranteed to work well on streams with low entropies. An easy-to-use implementation
in C of the all three versions of this algorithm may be found here.
Excel files containing results from
experiments run using all three implementations, as well as a writeup of the
trends observed in these experiments, can be found here. The files titled
“timevlengthfast.xls”, “timevlengthslow.xls”, and “timevlengthnaive.xls” each
contain two graphs, both with length as the dependent variable and time or
percent error as the independent variable. The files “timevzipffast.xls”,
“timevzipfslow.xls”, and “timevzipfnaive.xls” also have two graphs, both with
the zipf parameter (skewness of the distribution) as the dependent variable.
The fast version of this algorithm performed extremely well in these
experiments, achieving excellent accuracy and processing 400,000-1,000,000
tokens/second.