[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

VMs: Stolfi's Binomial word lengths revisited

Hi all,

I just had another look at Stolfi's word length analysis,
that is that word lengths are distributed as binom(9,0.5)
shifted by 1. The visual agreement between the observed word length and 
the expected frequencies from the binomial distribution is so 
good that I was convinced the match is highly significant statistically.
To my surprise, it isn't:

Testing the null hypothesis 
H0: word lengths are drawn from binom(9,0.5) shifted by 1
H1: not so

I got:

G-statistic = 17.26
df = 10
p-value = 0.045 ( = 1 - chi2cdf(17.26,9) )


chi-square = 50.37
df = 10
p-value < 10^-7  ( = 1 - chi2cdf(50.37,9) )

so the null hypothesis is flatly rejected by the chi-square test
and rejected at the 5% significance level by the G-test.

Even worst. The observed lengths include some 11- and 12-letters words, 
which are values that cannot be realized by the hypothesized
binomial distribution, but all such words occur only one time
throughout the word sample. These could easily be errors, so I
tried filtering out all words with only one occurrence in the sample.
This resulted in :

G-statistic = 319.51
chi-square = 317.42
do = 10
p-value < 10^-16 

One last point. Stolfi gives an example code that produce the binomial 
word length distribution. It should be noted that the code is limited to 
2^9=512 distinct values, whereas the VMS sample vocabulary contains 6525

I confess that I find these finding shocking, and I would rather believe
my eyes. Would anyone point out my mistake?

Outgoing mail is certified Virus Free.
Checked by AVG anti-virus system (http://www.grisoft.com).
Version: 6.0.541 / Virus Database: 335 - Release Date: 14-Nov-03

To unsubscribe, send mail to majordomo@xxxxxxxxxxx with a body saying:
unsubscribe vms-list