[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: Fw: Character n anomaly



[Note: I prefer to use the standard parsing nomenclature, where a
`token' is an occurrence of a `word'. So, for me, the sentence 
"the man can open the can" contains 6 tokens but only 4 words.]

    > [Nick Pelling:] If I was going to fake [the ultra-regular word
    > length] distribution (but instead peaking at, say, 10), I'd take
    > a pack of modern cards, throw out all the court cards, and,
    > every time I turned over an ace, insert a space. Once in a
    > while, I'd have to shuffle the deck: but basically that would be
    > it.
    >
    > But with average length 6, the easiest way would be to roll a
    > normal 6-sided dice: if it's a six, insert a space. How far off
    > is that from the observed distribution?

I am afraid that it won't do. With your method, the probability of a
random text token having k letters would be roughly p*(1-p)**(k-1)
where p is the probability of inserting a space (1/10 or 1/6 in your
examples).

This is an exponentially decaying distribution, which is quite
different from the humped and tail-less distribution we observe in the
VMS.

As for the distribution of *word* lengths: I haven't done the math,
but I believe that, if spaces were inserted at random, we should see
many more different words than we see. For instance, every letter
sequence with 1 or 2 letters should occur in the VMS --- which is
clearly not the case.

One way to test "Nullspace" theories is to remove all spaces from the
VMS text, then re-insert them according to the proposed method. If the
theory is correct, the resulting text should have the same word
statistics and structure as the original. The above space-insertion
methods would definitely fail this test.

In fact, the symmetrical distribution of word lengths is only a small
part of the picture. That feature is clearly connected with the very
rigid internal structure of the VMS words --- which seems to be
utterly incompatible with the theory that spaces are inserted at
random.

All the best,

--stolfi