VMs: RE: Inks and retouching

```Is there an a priori argument for the binomial distribution? There are many
discrete distributions that will give a one-hump dist. with a longer tail on
the right than on the left: Negative binomial, Poisson, and others.  If the
letters making up the words also have a use distribution, then are we
looking at a conditional distro for the word length? oh what a tangled web.
. .
Don

-----Original Message-----
From: owner-vms-list@xxxxxxxxxxx [mailto:owner-vms-list@xxxxxxxxxxx]On
Behalf Of Jorge Stolfi
Sent: Wednesday, July 21, 2004 4:54 PM
To: vms-list@xxxxxxxxxxx
Subject: VMs: Inks and retouching

> [Eric:] I also plotted word length and... I got a binomial plot
> for it???.

This is Eric's data:

00       0
01      31
02     168
03    1342
04    4719
05   10199
06   16818
07   21118
08   22302
09   20426
10   16409
11   11697
12    7566
13    4451
14    2342
15    1158
16     479
17     250
18      81
19      32
20      14
21       4
22       1
23       2

The distribution is "single-humped" but not binomial. Even without
plotting you can see that it falls off more slowly at the high
end than at the low end. (The peak is around 8; compare 3 with 13.)

Contrast it with the Voynichese curve, wich is not only symmetrical
but matches C*choose(9,k-1) almost to the pixel.

Still, the near symmetry of the plot above is quite puzzling. You have
seen my plots: for English (as for many other languages) I get a much
longer tail, visibly a second hump.  (I don't get that tail with Quran
Arabic, or the Towneley Plays, or the Asian languages.)

I suspected that the second hump could be due to joined words, but the
source texts are of fairly good quality, and I spent many hours
cleaning them up - uniformizing punctuation, disambiguating the "." of
abbreviation, marking off foreign language bits, etc.

However the UWA list is very "dirty" - it has many foreign words and
proper names, acronyms, obscure words, etc. It is also very irregular
in its coverage of plurals and other derived words:

VOL            VOLAR        VOLATILIZE   VOLCANICITY  VOLCANOS
VOL-AU-VENT    VOLARY       VOLATIZE     VOLCANISM    VOLE
VOLAGE         VOLATIC      VOLBORTHITE  VOLCANIST    VOLENT
VOLANT         VOLATILE     VOLCAN       VOLCANIZE    VOLES
VOLANTE        VOLATILITIES VOLCANIAN    VOLCANO      VOLET

Note the lack of VOLATILES, VOLATIZED, VOLCANISTS, etc.

My guess is that the UWA list was in large part derived from dictionaries
rather than actual texts. A dictionary entry -- especially a minor one --
will typically list the root word but omit the regular derivatives,
so you will get VOLATILIZE but not VOLATILIZATION, which in actual
texts may be even more common than the verb itself.

All the best,

--stolfi

______________________________________________________________________
To unsubscribe, send mail to majordomo@xxxxxxxxxxx with a body saying:
unsubscribe vms-list

______________________________________________________________________
To unsubscribe, send mail to majordomo@xxxxxxxxxxx with a body saying:
unsubscribe vms-list

```