[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: On the word length distribution




    > [Jacques:] I wonder: what if there were no real
    > word separations; if the separations were a consequence
    > of rules like: "after y always a space, before q always
    > a space", etc. -- a system similar to Arabic? Actually,
    > in Arabic, there are word separations on top of these
    > "letter separations". He r e is an e xample of how E
    > nglish would look like with wor d se par ations and a
    > r ule "afte r e and r always a space".
    
If the letters are generated at random, this process should give an
exponentially decaying distribution of token lengths (i.e. tokens with
length k would occur with probability A**k*(1 - A), for some A < 1).  

(I am not sure what the *word* length distribution would be -- I owe
you that.)

But the letters of the VMS are not generated at random; there are all
sort of "phonological" rules about allowed digraphs, plus some rules
that apply to the word as a whole, like "there is at most one gallows
per word" and "each word is a single hill" (where the heights are
basically {q m n } < {d l r s} < {ch sh} < {k t p f}, plus some tweaks
for {e i a o y} and rare letters.)

I don't know whether it is possible to get these two rules (and the
right statistics) with a low-order markov monkey plus word splitting
rules based on local context. We could get single-hill words by
inserting a space at every "valley bottom"; but I suspect that the
token length distribution would be too biased towards short words.

--stolfi