[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
VMs: Inks and retouching
> [Eric:] I also plotted word length and... I got a binomial plot
> for it???.
This is Eric's data:
The distribution is "single-humped" but not binomial. Even without
plotting you can see that it falls off more slowly at the high
end than at the low end. (The peak is around 8; compare 3 with 13.)
Contrast it with the Voynichese curve, wich is not only symmetrical
but matches C*choose(9,k-1) almost to the pixel.
Still, the near symmetry of the plot above is quite puzzling. You have
seen my plots: for English (as for many other languages) I get a much
longer tail, visibly a second hump. (I don't get that tail with Quran
Arabic, or the Towneley Plays, or the Asian languages.)
I suspected that the second hump could be due to joined words, but the
source texts are of fairly good quality, and I spent many hours
cleaning them up - uniformizing punctuation, disambiguating the "." of
abbreviation, marking off foreign language bits, etc.
However the UWA list is very "dirty" - it has many foreign words and
proper names, acronyms, obscure words, etc. It is also very irregular
in its coverage of plurals and other derived words:
VOL VOLAR VOLATILIZE VOLCANICITY VOLCANOS
VOL-AU-VENT VOLARY VOLATIZE VOLCANISM VOLE
VOLAGE VOLATIC VOLBORTHITE VOLCANIST VOLENT
VOLANT VOLATILE VOLCAN VOLCANIZE VOLES
VOLANTE VOLATILITIES VOLCANIAN VOLCANO VOLET
VOLAPUK VOLATILITY VOLCANIC VOLCANOES VOLGOGRAD
Note the lack of VOLATILES, VOLATIZED, VOLCANISTS, etc.
My guess is that the UWA list was in large part derived from dictionaries
rather than actual texts. A dictionary entry -- especially a minor one --
will typically list the root word but omit the regular derivatives,
so you will get VOLATILIZE but not VOLATILIZATION, which in actual
texts may be even more common than the verb itself.
All the best,
To unsubscribe, send mail to majordomo@xxxxxxxxxxx with a body saying: