[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
VMs: RE: Inks and retouching
Is there an a priori argument for the binomial distribution? There are many
discrete distributions that will give a one-hump dist. with a longer tail on
the right than on the left: Negative binomial, Poisson, and others. If the
letters making up the words also have a use distribution, then are we
looking at a conditional distro for the word length? oh what a tangled web.
. .
Don
-----Original Message-----
From: owner-vms-list@xxxxxxxxxxx [mailto:owner-vms-list@xxxxxxxxxxx]On
Behalf Of Jorge Stolfi
Sent: Wednesday, July 21, 2004 4:54 PM
To: vms-list@xxxxxxxxxxx
Subject: VMs: Inks and retouching
> [Eric:] I also plotted word length and... I got a binomial plot
> for it???.
This is Eric's data:
00 0
01 31
02 168
03 1342
04 4719
05 10199
06 16818
07 21118
08 22302
09 20426
10 16409
11 11697
12 7566
13 4451
14 2342
15 1158
16 479
17 250
18 81
19 32
20 14
21 4
22 1
23 2
The distribution is "single-humped" but not binomial. Even without
plotting you can see that it falls off more slowly at the high
end than at the low end. (The peak is around 8; compare 3 with 13.)
Contrast it with the Voynichese curve, wich is not only symmetrical
but matches C*choose(9,k-1) almost to the pixel.
Still, the near symmetry of the plot above is quite puzzling. You have
seen my plots: for English (as for many other languages) I get a much
longer tail, visibly a second hump. (I don't get that tail with Quran
Arabic, or the Towneley Plays, or the Asian languages.)
I suspected that the second hump could be due to joined words, but the
source texts are of fairly good quality, and I spent many hours
cleaning them up - uniformizing punctuation, disambiguating the "." of
abbreviation, marking off foreign language bits, etc.
However the UWA list is very "dirty" - it has many foreign words and
proper names, acronyms, obscure words, etc. It is also very irregular
in its coverage of plurals and other derived words:
VOL VOLAR VOLATILIZE VOLCANICITY VOLCANOS
VOL-AU-VENT VOLARY VOLATIZE VOLCANISM VOLE
VOLAGE VOLATIC VOLBORTHITE VOLCANIST VOLENT
VOLANT VOLATILE VOLCAN VOLCANIZE VOLES
VOLANTE VOLATILITIES VOLCANIAN VOLCANO VOLET
VOLAPUK VOLATILITY VOLCANIC VOLCANOES VOLGOGRAD
Note the lack of VOLATILES, VOLATIZED, VOLCANISTS, etc.
My guess is that the UWA list was in large part derived from dictionaries
rather than actual texts. A dictionary entry -- especially a minor one --
will typically list the root word but omit the regular derivatives,
so you will get VOLATILIZE but not VOLATILIZATION, which in actual
texts may be even more common than the verb itself.
All the best,
--stolfi
______________________________________________________________________
To unsubscribe, send mail to majordomo@xxxxxxxxxxx with a body saying:
unsubscribe vms-list
______________________________________________________________________
To unsubscribe, send mail to majordomo@xxxxxxxxxxx with a body saying:
unsubscribe vms-list