[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: VMs: excessive frequency of doubles...



Let me put it another way...

If you randomly scrambled the positions of the 37000 words in the
sample, given the frequency of the 50+ words in my table, you would
expect to find (on average) only 2 or 3 doubles of any them.
There are actually 73.   This is why it is significant.

Hence (IMHO) the process that generated each word was not
independent from the creation of neighbouring words.

Marke


-----Original Message-----
From: owner-vms-list@xxxxxxxxxxx [mailto:owner-vms-list@xxxxxxxxxxx]On
Behalf Of elvogt@xxxxxxxxxxx
Sent: 18 August 2004 10:03
To: vms-list@xxxxxxxxxxx
Subject: Re: VMs: excessive frequency of doubles...


Zitat von Marke Fincher <markefincher@xxxxxxxxxxxxxxxxxxxxx>:

>
> In the figures below you can see that the actual number of doubled
> words is in many cases way beyond what you should expect if the
> words were created independently by a random process.
>

Uhm... yes, but this is to be expected.

It is just reasonable to assume that some events happen more frequently than
average, as much as some events will be more rare than average.

All of your events are distinguished by a very low number of occurences;
most
higher-than-expected values are due to a single occurence of a word
doubling,
and the dramatically high values at the beginning of the table are due to
low
overall frequencies of the words in question: The top 7 entries of your list
are held by words which occur less than six times in the VM, and happen to
be
doubled exactly _once_ each. How statistically significant is this?

To really get meaning out of your tables it'd be necessary to either check
also
the other end of the spectrum (are there words which are doubled _less_
often
than expected?), or limit yourself to frequent words.

Only when you get a lot of events, they begin to become statistically
meaningful. (One of my teachers used to say, "Statistics begins with 3.")

(Don't get me wrong: I also think that the VM is non-random text. It's just,
your numbers don't really support the non-random assumption. And actually
I'm
quite surprised that the apparent word-doubling of the VM doesn't stand out
more prominently from the statistics.)

Cheers,

   E.


-------------------------------------------------
debitel.net Webmail
______________________________________________________________________
To unsubscribe, send mail to majordomo@xxxxxxxxxxx with a body saying:
unsubscribe vms-list

______________________________________________________________________
To unsubscribe, send mail to majordomo@xxxxxxxxxxx with a body saying:
unsubscribe vms-list