Project Description
I am working on the following problem:
Given a peice of text, a list of possible authors, and samples of text
from each of the authors, determine who wrote the text.
Authorship is to be determined only by examining the content and
structure of the text rather than something like the name written at the end
or an IP address (in the case of email).
The general method of identifying the auhor of a text is:
1) build a style model for each of n possible authors based on texts written by
these n authors and
2) determine the author of the unknown text by applying some
distance measure or by using a toolkit such as LEMUR (a toolkit for
language modeling and information retrieval).
The style model for an author is based on the stylometric features
found in the author's text. The following features can all be used to
model an authors' style: word length distributions, sentence length
distributions, function word frequencies, email structural features
such as greetings and html tags, fraction of whitespace, punctuation, and
fraction of capitalized text.
It is worth noting that
context-dependant features (such as occurence of the word "halibut")
do not make good stylometric features because they say more about the topic of a
specific text than about an author's ingrained and subconcious writing style.
This project has many applications, from determining who wrote the 12
disputed Federalist papers and deciding whether or not Shakespeare wrote all
of his plays, to determining which criminal wrote an email detailing
plans to bomb a building.
I am working on this project with fellow
REU-er Ross Sowell. More on these applications can be found at
Ross's page.
|
Work
Our final presentation, which was delivered at DIMACS on July 21,
2004 can be found here.
Some tables, graphs, and scripts that we have been working on can be
found here (Note: these files are
probably only useful to us, as they are generally not documented).
Here is the presentation delivered to the DIMACS REU participants on
July 2.
|
References
F. Mosteller and D. L. Wallace. Applied Bayesian and Classical
Inference. The Case of The Federalist Papers. Springer-Verlag New
York Inc., 1984.
M. Corney, Analysing E-mail Text Authorship for
Forensic Purposes, 181 pages, 2003
|