[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

VMs: RE: Inks and retouching



Is there an a priori argument for the binomial distribution? There are many
discrete distributions that will give a one-hump dist. with a longer tail on
the right than on the left: Negative binomial, Poisson, and others.  If the
letters making up the words also have a use distribution, then are we
looking at a conditional distro for the word length? oh what a tangled web.
. .
Don

-----Original Message-----
From: owner-vms-list@xxxxxxxxxxx [mailto:owner-vms-list@xxxxxxxxxxx]On
Behalf Of Jorge Stolfi
Sent: Wednesday, July 21, 2004 4:54 PM
To: vms-list@xxxxxxxxxxx
Subject: VMs: Inks and retouching



  > [Eric:] I also plotted word length and... I got a binomial plot
  > for it???.

This is Eric's data:

  00       0
  01      31
  02     168
  03    1342
  04    4719
  05   10199
  06   16818
  07   21118
  08   22302
  09   20426
  10   16409
  11   11697
  12    7566
  13    4451
  14    2342
  15    1158
  16     479
  17     250
  18      81
  19      32
  20      14
  21       4
  22       1
  23       2

The distribution is "single-humped" but not binomial. Even without
plotting you can see that it falls off more slowly at the high
end than at the low end. (The peak is around 8; compare 3 with 13.)

Contrast it with the Voynichese curve, wich is not only symmetrical
but matches C*choose(9,k-1) almost to the pixel.

Still, the near symmetry of the plot above is quite puzzling. You have
seen my plots: for English (as for many other languages) I get a much
longer tail, visibly a second hump.  (I don't get that tail with Quran
Arabic, or the Towneley Plays, or the Asian languages.)

I suspected that the second hump could be due to joined words, but the
source texts are of fairly good quality, and I spent many hours
cleaning them up - uniformizing punctuation, disambiguating the "." of
abbreviation, marking off foreign language bits, etc.

However the UWA list is very "dirty" - it has many foreign words and
proper names, acronyms, obscure words, etc. It is also very irregular
in its coverage of plurals and other derived words:

  VOL            VOLAR        VOLATILIZE   VOLCANICITY  VOLCANOS
  VOL-AU-VENT    VOLARY       VOLATIZE     VOLCANISM    VOLE
  VOLAGE         VOLATIC      VOLBORTHITE  VOLCANIST    VOLENT
  VOLANT         VOLATILE     VOLCAN       VOLCANIZE    VOLES
  VOLANTE        VOLATILITIES VOLCANIAN    VOLCANO      VOLET
  VOLAPUK        VOLATILITY   VOLCANIC     VOLCANOES    VOLGOGRAD

Note the lack of VOLATILES, VOLATIZED, VOLCANISTS, etc.

My guess is that the UWA list was in large part derived from dictionaries
rather than actual texts. A dictionary entry -- especially a minor one --
will typically list the root word but omit the regular derivatives,
so you will get VOLATILIZE but not VOLATILIZATION, which in actual
texts may be even more common than the verb itself.

All the best,

--stolfi

______________________________________________________________________
To unsubscribe, send mail to majordomo@xxxxxxxxxxx with a body saying:
unsubscribe vms-list



______________________________________________________________________
To unsubscribe, send mail to majordomo@xxxxxxxxxxx with a body saying:
unsubscribe vms-list