Re: VMs: Gordon Rugg's study follow ups

Bruce Grant wrote:

This raises an interesting question: is there a good way to measure the similarity of texts objectively?

(I realize that calculation of entropy is one such test, but a broad one to be sure.)

Way too broad, I think.

I have seen a couple of books which applied "stylometric" techniques to New Testament Greek texts, for example, by comparing the relative frequencies of synonyms, but these techniques don't appear too useful for a text whose meaning is unknown.

Yes, about 40 years ago I saw article on that, in our favorite publication in fact! The article said that not all of Paul's letters in the New Testament are in fact by Paul, something now accepted by the majority of scholars. I also saw a book, *Trouble Enough*, doing the same thing on the Book of Mormon. I think the algorithm counted common words like "the", "and", various prepositions, etc. I believe this is also what the New Testament studies did. The *Trouble Enough* study showed that the books of the Book of Mormon were not by several authors. So these methods compare texts within a corpus and could help establish the difference between A and B, but I don't know what else they could do.

Most recently, the SHAXICON style checker showed that Newsweek staffer Joe Klein wrote the Clinton-era *roman a` clef* "Primary Colors". I've also seen a style checker used to identify a notorious troll on USENET. I don't know how these programs work. Jim Gillogly mentioned SHAXICON on the list a long time ago, so perhaps he does.

Gabriel compared the Zipf's Law curves of known Latin texts by different authors, and due to that wondered whether A and B are as different as we think. So that's something else to consider.

There is the chi-squared test, of course. Jacques has mentioned the phi-squared test from time to time and said that phi-squared tests not just the significance of an observed difference between two sample data sets but also the magnitude of the difference. That sounds like the best thing of all.


