[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

*To*: rene@xxxxxxxxxx*Subject*: Re: VMS words and Roman numerals*From*: Jorge Stolfi <stolfi@xxxxxxxxxxxxx>*Date*: Fri, 29 Dec 2000 01:24:53 -0200 (EDT)*Cc*: voynich@xxxxxxxx*Delivered-to*: reeds@research.att.com*In-reply-to*: <3A4BC47E.EC075BF@voynich.nu>*References*: <3A4A0DBA.BC28F9CE@voynich.nu> <200012271855.eBRIt0J00982@coruja.dcc.unicamp.br> <3A4A698E.4A60F050@voynich.nu> <200012280142.eBS1gHU01674@coruja.dcc.unicamp.br> <3A4BC47E.EC075BF@voynich.nu>*Reply-to*: stolfi@xxxxxxxxxxxxx*Sender*: jim@xxxxxxxxxxxxx

> [stolfi:] Indeed the Babylonians (and the Greek, Roman, > Chinese...) used a digit-position code: with different sets of > symbols for each position, omitting the zeros. Unfortunately, > all but the Romans had several choices per slot; so the word > length distribution for those numerals is not symmetrical. > [Rene:] Agreed, but since we know that not all possible words do > exist Do we? We don't know what are the "possible words". Perhaps we *do* have 90% of them. If the "cipher" is indeed based on a codebook that was built on the fly, then that is just what we expect. > even in the list of words (as opposed to the list of > tokens) the probabilities 'per slot' could be unequal, i.e. for > example 0.5 that it's empty and the other 0.5 divided over > various options. I don't follow. Even if the list of words is incomplete, as long as the probability of omitting a given word is independent of its length, the word length distribution (WLD) should retain its original shape --- only the multiplying factor should change. On the other hand, if the probability of seeing a given word depends on its length, then we would need an amazing coincidence to explain the observed symmetry of the WLD. So the only plausible explanation for the symmetry of the WLD, I think, is that the number H_k of *possible* words of length k is indeed close to C*binom(9,k-1), for some constant C. Irrespective of the probabilities, if the set of alternatives for each slot is not symmetrical with respect to length, then the WLD will not be symmetrical. Suppose, for example, that we have three slots with alternatives empty/A/a, empty/B/b, empty/C/c (i.e. a base-3 version of the Greek number system). Then the ideal WLD will be 1:6:12:8. Even if some fraction of the words were missed, we would still get a WLD with roughly that shape. To get a symmetric WLD with the model above, we would need to sample the set of k-letter words with probability proportional to (1/2)**k. That doesn't seem a likely scenario: as the text gets longer, the probability of observing a valid word tends to 1. So we would need another amazing coincidence between text length and letter probabilities to explain a symmetric WLD. > Lastly, the various not-quite-but-almost binomial distributions > shown in Jorge's posts mostly differ in the areas of extreme > word lengths. When plotted on a linear scale as in the figure on > Jorge's web page, the difference might not at all be noticeable.... Right. The binomial distributions for N ~ 10 are already very close to Gaussians, so the visual fit only gives two parameters: the mean word length and the deviation. That is enough data to distinguish between binom(11,k) and binom(9,k-1), say; but is not enough to distinguish between a model with three 1:2:4:2:1 slots (the Roman numeral distribution) and one with twelve 1:1 slots (which is binom(12,k)). Perhaps we can get the needed extra information by looking closely at the wings of the Voynichese WLD. Unfortunately, that part of the WLD is the most sensitive to noise. For instance, the few 12-letter words we find in the VMS are likely to be pairs of words that were transcribed as one. So we can't even tell for sure what is the maximum length of a valid Voynichese word. > Thus, the Roman number system may not be the only choice after all. Not *the* Roman system exactly; but the statistics seem to be pointing towards something of that sort (see my recent reply to John Grove). > The idea that the binomial word length distribution could be due to > the combination of a smaller number of (largely) independent > sub-groups with individual symmetric length distributions does of > course take us back to the prefix - stem - suffix construction. Yes indeed. Besides being so far the only historical example of a code with near-binomial WLD, the Roman system is a very illuminating example, that explains why we haven't been able to identify the "slots" of the VMS code: Imagine a Martian who is given a long random list of subtractive Roman numerals, between 1 and 999, somewhat noisy and incomplete; and is asked to figure out the code. He would quickly discover that the "words" have some kind of layer structure: D and C tend to occur near the beginning, X and L near the middle, I and V near the end. In particular, IC and VD will be absent, as this layer model predicts. He would probably notice that the digits V, L, and D occur with probability close to 1/2 each, and are mutually independent; and that the WLD is surprisingly symmetrical, and almost (but definitely not quite) binomial. From that he would rightly deduce that the "code" is composed of some number of slots, each to be filled with one element chosen from a length-symmetric set of alternatives. But then he would have a rather hard time in figuring out what are the slots and the corresponding element strings. The obvious guess would be the substrings conisisting of the characters {C,D}, {L,X}, and {I,V}. However that doesn't work --- for one thing, there are words like CXC and XIX where the subsets are interleaved. He may conjecture that the I is sometimes a pre-modifier and sometimes a post-modifier of V and X, but he would not know how to resolve the apparent ambiguity in XIV, nor why there are no instances of IXV or IXX or XIIV in the sample. It would take him a flash of genius to guess that IX, even though it contains an X, is actually a unit-slot element; and similarly that XC is a tens-slot element. Well, it seems that we are like that Martian. We already have lots of tantalizing statistics and confusing hints. We have the crust-core-mantle paradigm, which is essentially a six- or seven-slot model, depending on how you count. However, we still do not know what are the valid strings that can be filled in each slot. The obvious partition by letter class (dealers, gallows, etc.) seems to be ambiguous. Many combinations seem to be mysteriously forbidden. The rules governing the placement of <o>s and <e>s seem to be terribly complicated. And so on. What we need now, I feel, is only a small flash of genius... All the best, --stolfi

**Follow-Ups**:**Re: VMS words and Roman numerals***From:*Gabriel Landini

**Re: VMS words and Roman numerals***From:*Rene Zandbergen

**References**:**RE: On the word length distribution***From:*Rene Zandbergen

**VMS words and Roman numerals***From:*Jorge Stolfi

**Re: VMS words and Roman numerals***From:*Rene Zandbergen

- Prev by Date:
**Re: Caramuel, Lobkowitz y Chinese** - Next by Date:
**V-shaped battlements** - Previous by thread:
**Re: VMS words and Roman numerals** - Next by thread:
**Re: VMS words and Roman numerals** - Index(es):