July 20 2004

Table containing random log x values for four feature sets. First column is log x for combined features, second column is sum of log x for four individual features.

Another table this time with two random features chosen.
Ranking of various features in the listserv dataset using variance.
Greedy algorithm results.
More details for above table.
My ppt slides

listserv lda results


July 16 2004

Here is a table of feature vectors used as input for BBR.

Ranking of lda performance of various sets of features.


July 15 2004

Results from linear discriminant analysis in R for four features chosen from the contigency table rankings.
A table of the values used for the lda above.

The following four plots are slices of the 4-D graph of the four features described above. Each graph contains a scatter plot of three features:
plot 1
plot 2
plot 3
plot 4

Results of linear discriminant analysis for different combinations of features:
The following classified all disputed papers as Madison:
upon, 1-letter words, 2-letter words
top 3 function words (upon, enough, there)
upon, there (top 2 function words)
The following classified most displuted papers as Madison:
upon, 2-letter words, 3-letter words
The following did not distinguish well:
2-letter words, 3-letter words
1-letter words, 2-letter words
1-letter words, 2-letter words, 5-letter words (top 3 word lengths)
there, 1-letter words, 2-letter words
2-letter words, 5-letter words




July 14 2004

We rank the discriminating ability of a set of features by computing a 2x2 contingency table for each feature, and then taking the log of the determinant of the table.
The features considered are 30 function words used by Mosteller and Wallace as well as the frequency of n-letter words for n ranging from 1 to 19.
Results when a constant of 1 is added to each contingency table are here.
Results when a constant of 0.5 is added to each contingency table are here.


July 7 2004

Results from linear discriminant analysis in R


July 6 2004

Rates of the word 'upon' in the federalist papers, per 1000 words
Here is a 3-D scatter plot containing the fraction of length-3 words vs. the fraction of length-2 words vs. the frequency of the word 'upon' (a Hamilton marker word)


June 30 2004

Graph containing the fraction of length-3 words on one axis, and the fraction of length 2 words on the other axis, for the Madison, Hamilton, and disputed papers.
Our powerpoint presentation!


June 29 2004

We counted the number of occurences of words of length i in the Federalist papers.
Here is a graph of word length frequency data.
And the results of Chi-squared test for word length data

Next we computed the ratio of 3-letter words to 2-letter words for the Federalist papers.
Here is a table of this word length ratio data. Ratios are computed for a set of Hamilton papers, a set of Madison papers, individual undisputed papers, and individual disputed papers.

Results of the Kolmogorov-Smirnov test applied to two sets of sample data:
Let r = ratio of 3-letter words to 2-letter words.
The first set of sample data corresponds to r for 14 Madison papers.
The second set of sample data corresponds to r for 18 Hamilton papers.

Here is a graph of the above data. THis shows a Madison cluster of the 3:2 word length ratios, and a Hamilton 3:2 ratio cluster. The disputed paper ratios fall between these two clusters.

3-letter : 2-letter word ratios: (Note, these are not very useful files, we use them to graph the word length ratio data)
Madison
Hamilton
Disputed


June 28 2004

results of Chi-squared test and more Chi-squared results, with the data placed in bins corresponding to a range of sentence length

Input data files for statistical tests:
Hamilton
Madison


June 24 and 25 2004

Note: these are transcripts from the R statistics package. The resulting p-value is at the very bottom of each page.
results of Kolmogorov-Smirnov test
results of Kolmogorov-Smirnov test 2 Here the values are "made continous" i.e. values are randomly perturbed so that no value repeated.


June 22 2004

madison_sentence_profile.txt
hamilton_sentence_profile.txt
all_hamilton_sentence_profile.txt

The next few files contain two columns each. The first column is the index i, corresponding to sentence length. The second column is the number of sentences of length <= i found in a set of federalist papers.

madison_cum.txt 14 Federalist papers attributed to Madison, 1139 total sentences.
hamilton_cum.txt 18 Federalist papers attributed to Hamilton, 1142 total sentences.
all_hamilton_cum.txt All papers attributed to Hamilton.


June 18 2004

sentence_dist.txt
plot of the above table


June 17 2004

sent_length.pl
Madison_output.txt
word_stats.pl
word_count.txt