[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
RE: On the word length distribution
> I haven't been able to give this too much thought yet, but
> shouldn't there be some constraint on the probabilities
> in order to have the peak in the middle?
No, provided that all combinations do show up in the text (or, at
least, they are unbiasedly sampled with regards to length).
The probabilities will only affect the distribution of *token*
lengths, not that of *word* lengths.
> Also, I think that in (2) it should be allowed to have various
> different symbols in each slot (e.g. <empty>, Eva-ch or Eva-sh).
I almost fell for that too. But no, if you allow M alternatives for
each slot (besides <empty>), then you get more variety for longer
words than for shorter ones. The distribution will then be
binom(N,k)*(M**k) which is no longer symmetrical about N/2.
On the other hand, if we look at the *number* W_k (not the relative
frequency) of distinct words of length k, they fit the formula
W_k \approx 12 \choose(9,k-1)
Where the factor "12" actually varies between 11 and 13:
length W_k(observed) \choose(9,k-1) ratio
------- ------------- -------------- -------
1 19 1 19.
2 102 9 11.3
3 413 36 11.5
4 1058 84 12.6
5 1651 126 13.1
6 1654 126 13.1
7 1007 84 12.0
8 439 36 12.2
9 138 9 15.3
10 32 1 32.
Keep in mind that the counts are affected by noise, in both directions
(some valid words were lost before counting, and some words that were
counted are probably scribal or transcription errors.) In particular,
the counts for large k include many "words" which (from their
internal structure) are almost surely two words run together.
The simplest model that generates the theoretical distribution above
is to concatenate a `marker' symbol, chosen from among 12
possibilities, with nine other distinct symbols, each supressed with
probability 1/2.
Of course, there are many other possibilities. For instance, instead
of the 12-fold choice, one could use a single marker, and add two
binary diacritics and a ternary one to specific symbols of the word,
by a deterministic rule.
In fact, any deterministic length-preserving mapping of the above
"code" will preserve the word length distribution.
> Also, having fewer slots, where some can contain 0, 1 or
> 2 letters could result in a binomial ditribution
That is true if the probabilities are 1:2:1, respectively.
If the three possibilities are equally likely, the distribution
will not be exactly binomial. However, the difference may not
be visible in the plot (that's what the central limit theorem
is about).
> Indeed, I would not at all be surprised if the VMs contained
> nothing but numbers. Numbers would make a lot of sense
> for the labels near the zodiac nymphs, and these do fit in
> the standard word paradigms.
>
> Furthermore, having a binomial word length distribution
> but not a binomial token length distribution is completely
> logical if the text is a word for word encoding of some
> plaintext.
Yes. If the word codes are assigned sequentially as the words show up
in the text (or in some previous "practice" text), the most common
words will tend to get relatively shorter codes.
All the best,
--stolfi