[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: VMs: word database and binomial distribution

Jorge wrote:
The distribution is "single-humped" but not binomial.
Even without 
plotting you can see that it falls off more slowly at
the high
end than at the low end. (The peak is around 8;
compare 3 with 13.)

Contrast it with the Voynichese curve, wich is not
only symmetrical
but matches C*choose(9,k-1) almost to the pixel.

I quickly fit the data with a binomial and, solely by
eyeball, saw a signficantly better fit than what you
got with English and Latin samples. While not as good
a fit as the VMS, far closer to the VMS than to the
English you plotted. Which is what caught my eye.

My guess is that the UWA list was in large part
derived from 
rather than actual texts. A dictionary entry --
especially a minor one --
will typically list the root word but omit the regular
so you will get VOLATILIZE but not VOLATILIZATION,
which in actual
texts may be even more common than the verb itself.

Yes, their source comes from a dictionary and
thesaurus, not actual text. Your word length
comparisons are of a lexicon built from actual text. A
dictionary is the same using an infinitely large text
- every word appeared in an actual text somewhere
(even "foreign" words appear in a dictionary of the
"native" language because they are used like native
words in actual texts). Which with what Knox

--- Knox Mix <knoxmix@xxxxxxxxxxxxx> wrote:
> I have not tried to determine why this list gives a
> less extended right 
> leg than most of the meaningful documents I have
> looked at. The Towneley 
> Plays and Liber Salomonis are exceptional so perhaps
> the VMS is, also. 

This leads me to wonder is it a matter of word
exhaustion (for lack of a better term)? Using a
dictionary for a source of words - which is like using
an infinite amount of actual text - we see a pretty
sharp curve. Slightly asymmetrical, but not much. We
are seeing a plot of every word and its every
variation in the language.

By "word exhaustion" I don't mean in a purely
combinatorial sense (since that would just keep going
up with # of words by word length) but as a natural
function of language.

Linguists out there - is there a known phenomenon
along this line?

Is that what we are seeing in the Towneley Plays?
These were plays for the "common folk" correct?
Presumably we would then have a much smaller language
to use and we would reach our word exhaustion sooner.

This seems like a fairly easy idea to get some legs
under (or pulled out from). Start generating these
word length distribution plots with a little bit of
actual text, then add more and more from different
sources and see if the curve sharpens. Pidgon and
creole languages would probably sharpen sooner
(anybody have a source for a large body of text?). As
soon as I find some time I'll work it through.

--- Knox Mix <knoxmix@xxxxxxxxxxxxx> wrote:
> However, I am not able to compare the curves
> mathematically. A true 
> binomial distribution of unique words could mean
> something entirely 
> different than an apparent but not binomial
> distribution.

Yes, agreed. Which raises the question of, given the
number of possible corrections, transcription errors,
etc, even if the source was perfectly binomial, our
plot of it would not be, so how close do we get before
we say it is really binomial?

I think the plot Jorge found very interesting and
something that needs explaining - I very much doubt
its structure is by chance... either a human designed
construct (cipher, synthetic language) or it could be
a natural characteristic which has a significant
implication (such as the supposed word exhaustion
theory I just made up).

Finally, to GC's points - agreed on all.


Do you Yahoo!?
Vote for the stars of Yahoo!'s next ad campaign!
To unsubscribe, send mail to majordomo@xxxxxxxxxxx with a body saying:
unsubscribe vms-list