July 20 2004Table containing random log x values for four feature sets. First column is log x for combined features, second column is sum of log x for four individual features.Another table this time with two random features chosen. Ranking of various features in the listserv dataset using variance. Greedy algorithm results. More details for above table. My ppt slides listserv lda results |
July 16 2004Here is a table of feature vectors used as input for BBR.Ranking of lda performance of various sets of features. |
July 15 2004Results from linear discriminant analysis in R for four features chosen from the contigency table rankings.A table of the values used for the lda above. The following four plots are slices of the 4-D graph of the four features described above. Each graph contains a scatter plot of three features: plot 1 plot 2 plot 3 plot 4 Results of linear discriminant analysis for different combinations of features: The following classified all disputed papers as Madison: upon, 1-letter words, 2-letter words top 3 function words (upon, enough, there) upon, there (top 2 function words) The following classified most displuted papers as Madison: upon, 2-letter words, 3-letter words The following did not distinguish well: 2-letter words, 3-letter words 1-letter words, 2-letter words 1-letter words, 2-letter words, 5-letter words (top 3 word lengths) there, 1-letter words, 2-letter words 2-letter words, 5-letter words |
July 14 2004We rank the discriminating ability of a set of features by computing a 2x2 contingency table for each feature, and then taking the log of the determinant of the table.The features considered are 30 function words used by Mosteller and Wallace as well as the frequency of n-letter words for n ranging from 1 to 19. Results when a constant of 1 is added to each contingency table are here. Results when a constant of 0.5 is added to each contingency table are here. |
July 7 2004Results from linear discriminant analysis in R |
July 6 2004Rates of the word 'upon' in the federalist papers, per 1000 wordsHere is a 3-D scatter plot containing the fraction of length-3 words vs. the fraction of length-2 words vs. the frequency of the word 'upon' (a Hamilton marker word) |
June 30 2004Graph containing the fraction of length-3 words on one axis, and the fraction of length 2 words on the other axis, for the Madison, Hamilton, and disputed papers.Our powerpoint presentation! |
June 29 2004We counted the number of occurences of words of length i in the Federalist papers.Here is a graph of word length frequency data. And the results of Chi-squared test for word length data Next we computed the ratio of 3-letter words to 2-letter words for the Federalist papers. Here is a table of this word length ratio data. Ratios are computed for a set of Hamilton papers, a set of Madison papers, individual undisputed papers, and individual disputed papers. Results of the Kolmogorov-Smirnov test applied to two sets of sample data: Let r = ratio of 3-letter words to 2-letter words. The first set of sample data corresponds to r for 14 Madison papers. The second set of sample data corresponds to r for 18 Hamilton papers. Here is a graph of the above data. THis shows a Madison cluster of the 3:2 word length ratios, and a Hamilton 3:2 ratio cluster. The disputed paper ratios fall between these two clusters. 3-letter : 2-letter word ratios: (Note, these are not very useful files, we use them to graph the word length ratio data) Madison Hamilton Disputed |
June 28 2004results of Chi-squared test and more Chi-squared results, with the data placed in bins corresponding to a range of sentence lengthInput data files for statistical tests: Hamilton Madison |
June 24 and 25 2004Note: these are transcripts from the R statistics package. The resulting p-value is at the very bottom of each page.results of Kolmogorov-Smirnov test results of Kolmogorov-Smirnov test 2 Here the values are "made continous" i.e. values are randomly perturbed so that no value repeated. |
June 22 2004madison_sentence_profile.txthamilton_sentence_profile.txt all_hamilton_sentence_profile.txt The next few files contain two columns each. The first column is the index i, corresponding to sentence length. The second column is the number of sentences of length <= i found in a set of federalist papers. madison_cum.txt 14 Federalist papers attributed to Madison, 1139 total sentences.hamilton_cum.txt 18 Federalist papers attributed to Hamilton, 1142 total sentences. all_hamilton_cum.txt All papers attributed to Hamilton. |
June 18 2004sentence_dist.txtplot of the above table |
June 17 2004sent_length.plMadison_output.txt word_stats.pl word_count.txt |