[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: VMs: Gordon Rugg's study follow ups

To: vms-list@xxxxxxxxxxx
Subject: Re: VMs: Gordon Rugg's study follow ups
From: Bruce Grant <bgrant@xxxxxxxxxxxxx>
Date: Sat, 04 Dec 2004 18:44:46 -0500
In-reply-to: <41B155D0.8050502@asus.net>
References: <200412021016.iB2AGJe9019042@mail3.alphalink.com.au> <41B155D0.8050502@asus.net>
Reply-to: vms-list@xxxxxxxxxxx
Sender: owner-vms-list@xxxxxxxxxxx
User-agent: Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.4) Gecko/20030624 Netscape/7.1 (ax)

Dennis wrote:

I wonder what chi2 you would see between two texts in the
same language by the same author but about different
subjects. As I recall, the stylometric methods used in
the New Testament and Book of Mormon studies I mentioned
compared the frequencies of common words like "the", "and",
"where", "before", and the like. If so, these
methods might well show the same chi2 for the two texts.
You should certainly do that as a control if you were to
make use of those methods on the VMs.

I don't know a lot about the chi-square test, but whenever I have seen it used in books, it involved a situation where something was being classified into a small number of categories, and then the chi-squared statistic was calculated for the difference between the observed distribution by category and the distribution expected from some hypothesis. If the calculated statistic (which is larger for greater degrees of difference) was too large, the hypothesis was rejected..

For a pair of texts, though, it is not clear to me how to "categorize" the texts, that is, which of the many categorization schemes to use. For example, you could categorize the texts by the distribution of letter frequencies, or the distribution of word lengths,
the distribution of word positions of "gallows letter" words in the line, etc. etc. Each calculation would yield a different chi-square statistic..

The categorization is also sensitive to features of the languages involved. For example, the existence of lots of "fuzzy matches" between sentences in the VMS would suggest the possibility that some characters which EVA considers to be distinct might actually be the same. If this were true, it seems like it would effect the categorization of letter frequencies strongly and could result in a quite different chi-squared statistic.

Bruce

References:
- Re: VMs: Gordon Rugg's study follow ups
  - From: Jacques Guy
- Re: VMs: Gordon Rugg's study follow ups
  - From: Dennis

Prev by Date: Re: VMs: "The VMs Research Foundation"...?
Next by Date: VMs: The Brig
Previous by thread: Re: VMs: Gordon Rugg's study follow ups
Next by thread: Re: VMs: Gordon Rugg's study follow ups
Index(es):
- Date
- Thread