2005
Undergraduate Research Project
Student: Jordanna Chord- Homepage
Senior, Computer Science & Entrepreneurial Leadership
Gonzaga University, Spokane, WA
Email: jchord@gonzaga.edu
Project Name: Author Identification Challenge
Faculty Advisor:
Dr. Paul B.
Kantor, Professor II
School of Communication,
Information, and Library Studies
Rutgers University
Overall Summary
Alexander Genkin, David D. Lewis, and David Madigan have developed software that implements Bayesian logistic regression with two choices of priors: Gaussian and Laplace (also known as double exponential). A detailed technical report presenting theoretical background on the approach, the fitting algorithm, and experimental results can be found at http://stat.rutgers.edu/~madigan/PAPERS/shortFat-v13.ps .
This tool allows us to take a set of lexical data in which we can identify and create a model training file. This model training file can be applied to lexical data that we are attempting to identify and determine whether they are part of the same class.
For example, past research has been done on un-ambiguating Federalist Papers that was either written by Alexander Hamilton, James Madison by using word frequencies in the training model. We are able to determine the probability that a test document is of one class, Hamilton, or the other, Madison.
KDD Challenge
As part of the ongoing effort to improve and expand the BBR project, this research will include participation in a KDD challenge. Provided data from the BioSci database, we will run numerous tests to distinguish between authors who share the same name. Experimenting with a combination of data that will be provided for each document, we hope to be able to distinguish between authors whom can't be distinguished by name. Studies will include the use of various factors including co-authors, keywords, abstracts' lexicon, the address of the lead author, and the scientific journal classification.
Abstract
Entity Resolution for Authors of Biological Sciences Papers
Jordanna Chord, Gonzaga University, Spokane, WA
Melissa Mitchell, University of Detroit - Mercy, Detroit, MI
Mentor: Dr. Paul Kantor, SCILS
When several persons have the same name, we would like to be able to tell them apart, by characteristics of their writings. The intelligence community, which supports our work, has defined a "challenge problem" which approximates the real problem that they face. They will present many items and ask us to separate them into those authored by "different persons".
To prepare for this challenge problem, we have selected ten "author names" for which there is one prolific contributor, and approximately an equal number of papers by other (different) persons. We drew associated abstracts address information and keywords from the online database Pubmed.
We applied Bayesian Binary Regression to identify author using a combination of seven document attributes. We have achieved strong performance (defined as a Receiver Operating Characteristic with an area under the curve of 80% or better, in several ways. This can be done using only keywords, or only addresses, or with a representation that included all attributes: abstract words, address words, address fields, co-authors, keywords, and title.
Results clearly identified keywords and addresses as working well alone in identifying documents. One expects the use of all variables to be better. But there is little difference. This is interpreted as meaning that either set of variables can accomplish the classification, but that they do not complement each other. Future research will examine how the regression approach balances selection among them when all variables are included.
Thanks to: Alex Genkin, Dmitriy Fradkin and Andrei Anghelescu for crucial support with coding and the use of the BBR software.
Other Applications
The ability to classify documents is an important tool in various industries.
Identifying Spam
Identifying the identify of authors of emails or other communications for national security purposes
Identifying fraud - writer attempts to imitate another writer
Work in progress. Daily Journal. Also contains final results.
BBR: Bayesian Logistic Regression Software