[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: VMs: word database and binomial distribution

To: vms-list@xxxxxxxxxxx
Subject: Re: VMs: word database and binomial distribution
From: Jorge Stolfi <stolfi@xxxxxxxxxxxxx>
Date: Thu, 22 Jul 2004 19:15:12 -0300
Reply-to: vms-list@xxxxxxxxxxx
Sender: owner-vms-list@xxxxxxxxxxx

  
  > [Eric:] A dictionary is the same using an infinitely large text
  > - every word appeared in an actual text somewhere
  
No, that is precisely my point: a dictionary will typically omit
derived words, even fairly common ones (like VOLATILIZATION), in
favor of the root forms (VOLATILIZE), even if these are less common in
actual text.

I suspect that the "second hump" of most of my samples -- which
incidentally are all *literary* works -- comes precisely from long
derived words like "unsympathetic", "disillusionment", "periodically",
"intelligences" (all from the first couple of pages of "War of the
Worlds").  And that may also explain why the Towneley Plays gave
a single-humped (although still skewed) distribution: they were
meant to be recited, not read.

A more complicated explanation is needed for the single-humped WLD of
Arabic (Qur):

  http://www.dcc.unicamp.br/~stolfi/voynich/misc/wlds/langs-w-lengths-1-smit.png

The Quran was originally preserved in spoken form, and is still meant
to be recited, so superficially the same explanation given above for
the Towneley Plays would seem to apply here. However this explanation
is too simplistic because the weak vowels *are* pronounced; so, to
make the comparison with the Plays reasonably fair, we should use
the vowelled version of the text (Qur-V) -- which has a two-humped
shape!

It is possible that some single words of my Qur-V file are written as
two separate words in the Qur file (like "disillusionment" v.s
"disillusion ment"); I have to check that. Another possibility, which
a linguist perhaps may clarify, is that the weak vowels in Arabic are
largely redundant and therefore should not be counted as letters in
this sort of analysis (just as we do not count the invisible "y" glide
in "uses").

  > Which raises the question of, given the number of possible
  > corrections, transcription errors, etc, even if the source was
  > perfectly binomial, our plot of it would not be, so how close do
  > we get before we say it is really binomial?

Good question. 

For starters, we could suppose that the original text had a perfectly
binomial WLD C*choose(9,k-1). That distribution would result, for
example, from a codebook cipher using an "anchor" symbol with C
possible values, and nine binary presence/absence digits (as described
on my web page); and a text that used the first C*2^9 words of the
codebook, and only those. Given the VMS statistics, we should assume
that these words have a Zipf distribution (not sure whether this is
relevant), and that shorter words tend to occur more frequently in the
text than longer words (see the so-called *token* length
distribution).

Now suppose that the text is modified by loss of a few pages, and by
various reading errors. Only three types of error seem important:
mutation of a letter to produce another word of the same length,
splitting a token in two, and joining a token to the next one. We may
guess that some fraction (say 1/1000 or 1/100) of the *tokens* (not
words) are affected by each kind of error. Some of these errors will
affect the WLD, by creating a new "invalid" word or destroying all
instances of a "valid" word.

So the first problem is to quantify the effect these perturbations
could have on an originally binomial WLD. One may estimate the effect
analytically, but it may be easier to do it by running a few
simulations. Either way we need to obtain the mean WLD of the
resulting texts -- i.e. the mean number n(k) of distinct words seen in
the output, for each length k. We also need the variance v(k) of that
number over several runs of the process.

The next step is to compare those numbers to the Voynichese WLD, w(k).
We may assume that the above process would result in a WLD whose
entries have Gaussian distribution with mean n(k) and variance v(k).
We may also assume that the two hypotheses -- "it is a perfect
binomial C*choose(9,k-1) perturbed by errors" and "it is some random
single-humped distribution" (that needs to be adequately defined in
statistical terms) are equally likely /a priori/. Then we burn some
incense, and invoke the Bayes Oracle...

Anyone volunteers to carry out this analysis?  That could get the VMS
out of the "News" section of Nature and into the "Research" section...

All the best,

--stolfi
______________________________________________________________________
To unsubscribe, send mail to majordomo@xxxxxxxxxxx with a body saying:
unsubscribe vms-list

Follow-Ups:
- Re: VMs: word database and binomial distribution
  - From: Eric

Prev by Date: Re: Re: VMs: Re: VMs, RuggWatch
Next by Date: VMs: Re: Hajek, was Re: RuggWatch
Previous by thread: Re: VMs: Re: VMs, RuggWatch
Next by thread: Re: VMs: word database and binomial distribution
Index(es):
- Date
- Thread