[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: On the word length distribution
> [Jacques:] I wonder: what if there were no real
> word separations; if the separations were a consequence
> of rules like: "after y always a space, before q always
> a space", etc. -- a system similar to Arabic? Actually,
> in Arabic, there are word separations on top of these
> "letter separations". He r e is an e xample of how E
> nglish would look like with wor d se par ations and a
> r ule "afte r e and r always a space".
If the letters are generated at random, this process should give an
exponentially decaying distribution of token lengths (i.e. tokens with
length k would occur with probability A**k*(1 - A), for some A < 1).
(I am not sure what the *word* length distribution would be -- I owe
you that.)
But the letters of the VMS are not generated at random; there are all
sort of "phonological" rules about allowed digraphs, plus some rules
that apply to the word as a whole, like "there is at most one gallows
per word" and "each word is a single hill" (where the heights are
basically {q m n } < {d l r s} < {ch sh} < {k t p f}, plus some tweaks
for {e i a o y} and rare letters.)
I don't know whether it is possible to get these two rules (and the
right statistics) with a low-order markov monkey plus word splitting
rules based on local context. We could get single-hill words by
inserting a space at every "valley bottom"; but I suspect that the
token length distribution would be too biased towards short words.
--stolfi