[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: On the word length distribution

To: Jacques Guy <jguy@xxxxxxxxxxxxxxxx>
Subject: Re: On the word length distribution
From: Jorge Stolfi <stolfi@xxxxxxxxxxxxx>
Date: Wed, 27 Dec 2000 21:16:55 -0200 (EDT)
Cc: voynich@xxxxxxxx
Delivered-to: reeds@research.att.com
In-reply-to: <3A493E0E.DE9EAA44@alphalink.com.au>
References: <200012261522.eBQFMHp00621@coruja.dcc.unicamp.br> <3A493E0E.DE9EAA44@alphalink.com.au>
Reply-to: stolfi@xxxxxxxxxxxxx
Sender: jim@xxxxxxxxxxxxx


    > [Jacques:] I wonder: what if there were no real
    > word separations; if the separations were a consequence
    > of rules like: "after y always a space, before q always
    > a space", etc. -- a system similar to Arabic? Actually,
    > in Arabic, there are word separations on top of these
    > "letter separations". He r e is an e xample of how E
    > nglish would look like with wor d se par ations and a
    > r ule "afte r e and r always a space".
    
If the letters are generated at random, this process should give an
exponentially decaying distribution of token lengths (i.e. tokens with
length k would occur with probability A**k*(1 - A), for some A < 1).  

(I am not sure what the *word* length distribution would be -- I owe
you that.)

But the letters of the VMS are not generated at random; there are all
sort of "phonological" rules about allowed digraphs, plus some rules
that apply to the word as a whole, like "there is at most one gallows
per word" and "each word is a single hill" (where the heights are
basically {q m n } < {d l r s} < {ch sh} < {k t p f}, plus some tweaks
for {e i a o y} and rare letters.)

I don't know whether it is possible to get these two rules (and the
right statistics) with a low-order markov monkey plus word splitting
rules based on local context. We could get single-hill words by
inserting a space at every "valley bottom"; but I suspect that the
token length distribution would be too biased towards short words.

--stolfi

References:
- RE: On the word length distribution
  - From: Jorge Stolfi

Prev by Date: Voynich -- Opening The Doors #1
Next by Date: Re: Voynich -- Opening The Doors #1
Previous by thread: RE: On the word length distribution
Next by thread: Re: On the word length distribution
Index(es):
- Date
- Thread