[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

*To*: rene@xxxxxxxxxx*Subject*: RE: On the word length distribution*From*: Jorge Stolfi <stolfi@xxxxxxxxxxxxx>*Date*: Wed, 27 Dec 2000 19:30:27 -0200 (EDT)*Cc*: voynich@xxxxxxxx*Delivered-to*: reeds@research.att.com*In-reply-to*: <3A4A0DBA.BC28F9CE@voynich.nu>*References*: <3A4A0DBA.BC28F9CE@voynich.nu>*Reply-to*: stolfi@xxxxxxxxxxxxx*Sender*: jim@xxxxxxxxxxxxx

> I haven't been able to give this too much thought yet, but > shouldn't there be some constraint on the probabilities > in order to have the peak in the middle? No, provided that all combinations do show up in the text (or, at least, they are unbiasedly sampled with regards to length). The probabilities will only affect the distribution of *token* lengths, not that of *word* lengths. > Also, I think that in (2) it should be allowed to have various > different symbols in each slot (e.g. <empty>, Eva-ch or Eva-sh). I almost fell for that too. But no, if you allow M alternatives for each slot (besides <empty>), then you get more variety for longer words than for shorter ones. The distribution will then be binom(N,k)*(M**k) which is no longer symmetrical about N/2. On the other hand, if we look at the *number* W_k (not the relative frequency) of distinct words of length k, they fit the formula W_k \approx 12 \choose(9,k-1) Where the factor "12" actually varies between 11 and 13: length W_k(observed) \choose(9,k-1) ratio ------- ------------- -------------- ------- 1 19 1 19. 2 102 9 11.3 3 413 36 11.5 4 1058 84 12.6 5 1651 126 13.1 6 1654 126 13.1 7 1007 84 12.0 8 439 36 12.2 9 138 9 15.3 10 32 1 32. Keep in mind that the counts are affected by noise, in both directions (some valid words were lost before counting, and some words that were counted are probably scribal or transcription errors.) In particular, the counts for large k include many "words" which (from their internal structure) are almost surely two words run together. The simplest model that generates the theoretical distribution above is to concatenate a `marker' symbol, chosen from among 12 possibilities, with nine other distinct symbols, each supressed with probability 1/2. Of course, there are many other possibilities. For instance, instead of the 12-fold choice, one could use a single marker, and add two binary diacritics and a ternary one to specific symbols of the word, by a deterministic rule. In fact, any deterministic length-preserving mapping of the above "code" will preserve the word length distribution. > Also, having fewer slots, where some can contain 0, 1 or > 2 letters could result in a binomial ditribution That is true if the probabilities are 1:2:1, respectively. If the three possibilities are equally likely, the distribution will not be exactly binomial. However, the difference may not be visible in the plot (that's what the central limit theorem is about). > Indeed, I would not at all be surprised if the VMs contained > nothing but numbers. Numbers would make a lot of sense > for the labels near the zodiac nymphs, and these do fit in > the standard word paradigms. > > Furthermore, having a binomial word length distribution > but not a binomial token length distribution is completely > logical if the text is a word for word encoding of some > plaintext. Yes. If the word codes are assigned sequentially as the words show up in the text (or in some previous "practice" text), the most common words will tend to get relatively shorter codes. All the best, --stolfi

**References**:**RE: On the word length distribution***From:*Rene Zandbergen

- Prev by Date:
**Re: On the word length distribution** - Next by Date:
**RE: On the word length distribution: Juan Caramuel y Lobkowitz** - Previous by thread:
**RE: On the word length distribution** - Next by thread:
**Re: On the word length distribution** - Index(es):