[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: VMS words and Roman numerals

    > [stolfi:] Indeed the Babylonians (and the Greek, Roman,
    > Chinese...) used a digit-position code: with different sets of
    > symbols for each position, omitting the zeros. Unfortunately,
    > all but the Romans had several choices per slot; so the word
    > length distribution for those numerals is not symmetrical.

    > [Rene:] Agreed, but since we know that not all possible words do
    > exist
Do we? We don't know what are the "possible words". Perhaps we *do* 
have 90% of them. If the "cipher" is indeed based on a codebook that was
built on the fly, then that is just what we expect.

    > even in the list of words (as opposed to the list of
    > tokens) the probabilities 'per slot' could be unequal, i.e. for
    > example 0.5 that it's empty and the other 0.5 divided over
    > various options.

I don't follow.

Even if the list of words is incomplete, as long as the probability of
omitting a given word is independent of its length, the word length
distribution (WLD) should retain its original shape --- only the
multiplying factor should change.

On the other hand, if the probability of seeing a given word depends on
its length, then we would need an amazing coincidence to explain the
observed symmetry of the WLD.

So the only plausible explanation for the symmetry of the WLD, I
think, is that the number H_k of *possible* words of length k is
indeed close to C*binom(9,k-1), for some constant C.

Irrespective of the probabilities, if the set of alternatives for each
slot is not symmetrical with respect to length, then the WLD will not
be symmetrical. Suppose, for example, that we have three slots with
alternatives empty/A/a, empty/B/b, empty/C/c (i.e. a base-3 version of
the Greek number system). Then the ideal WLD will be 1:6:12:8. Even if
some fraction of the words were missed, we would still get a WLD with
roughly that shape.

To get a symmetric WLD with the model above, we would need to sample
the set of k-letter words with probability proportional to (1/2)**k.
That doesn't seem a likely scenario: as the text gets longer, the
probability of observing a valid word tends to 1. So we would need
another amazing coincidence between text length and letter
probabilities to explain a symmetric WLD.

    > Lastly, the various not-quite-but-almost binomial distributions
    > shown in Jorge's posts mostly differ in the areas of extreme
    > word lengths. When plotted on a linear scale as in the figure on
    > Jorge's web page, the difference might not at all be noticeable....


The binomial distributions for N ~ 10 are already very close to
Gaussians, so the visual fit only gives two parameters: the mean word
length and the deviation. That is enough data to distinguish between
binom(11,k) and binom(9,k-1), say; but is not enough to distinguish
between a model with three 1:2:4:2:1 slots (the Roman numeral
distribution) and one with twelve 1:1 slots (which is binom(12,k)).

Perhaps we can get the needed extra information by looking closely at
the wings of the Voynichese WLD. Unfortunately, that part of the WLD
is the most sensitive to noise. For instance, the few 12-letter words
we find in the VMS are likely to be pairs of words that were
transcribed as one. So we can't even tell for sure what is the maximum
length of a valid Voynichese word.

    > Thus, the Roman number system may not be the only choice after all.

Not *the* Roman system exactly; but the statistics seem to be pointing
towards something of that sort (see my recent reply to John Grove).

    > The idea that the binomial word length distribution could be due to
    > the combination of a smaller number of (largely) independent
    > sub-groups with individual symmetric length distributions does of
    > course take us back to the prefix - stem - suffix construction.

Yes indeed.

Besides being so far the only historical example of a code with
near-binomial WLD, the Roman system is a very illuminating
example, that explains why we haven't been able to identify
the "slots" of the VMS code:

  Imagine a Martian who is given a long random list of subtractive
  Roman numerals, between 1 and 999, somewhat noisy and incomplete;
  and is asked to figure out the code.

  He would quickly discover that the "words" have some kind of layer
  structure: D and C tend to occur near the beginning, X and L near
  the middle, I and V near the end. In particular, IC and VD will be
  absent, as this layer model predicts. 
  He would probably notice that the digits V, L, and D occur with
  probability close to 1/2 each, and are mutually independent; and
  that the WLD is surprisingly symmetrical, and almost (but definitely
  not quite) binomial. From that he would rightly deduce that the
  "code" is composed of some number of slots, each to be filled with 
  one element chosen from a length-symmetric set of alternatives.

  But then he would have a rather hard time in figuring out what are
  the slots and the corresponding element strings. The obvious guess
  would be the substrings conisisting of the characters {C,D}, {L,X},
  and {I,V}. However that doesn't work --- for one thing, there are
  words like CXC and XIX where the subsets are interleaved.
  He may conjecture that the I is sometimes a pre-modifier and
  sometimes a post-modifier of V and X, but he would not know how to
  resolve the apparent ambiguity in XIV, nor why there are no
  instances of IXV or IXX or XIIV in the sample.
  It would take him a flash of genius to guess that IX, even though it
  contains an X, is actually a unit-slot element; and similarly that
  XC is a tens-slot element.

Well, it seems that we are like that Martian. We already have lots of
tantalizing statistics and confusing hints. We have the
crust-core-mantle paradigm, which is essentially a six- or seven-slot
model, depending on how you count. However, we still do not know what
are the valid strings that can be filled in each slot. The obvious
partition by letter class (dealers, gallows, etc.) seems to be
ambiguous. Many combinations seem to be mysteriously forbidden. The
rules governing the placement of <o>s and <e>s seem to be terribly
complicated. And so on.

What we need now, I feel, is only a small flash of genius...

All the best,