[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

*To*: "John Grove" <John@xxxxxxxxxxxx>*Subject*: Re: On the word length distribution*From*: Jorge Stolfi <stolfi@xxxxxxxxxxxxx>*Date*: Thu, 28 Dec 2000 16:34:22 -0200 (EDT)*Cc*: voynich@xxxxxxxx*Delivered-to*: reeds@research.att.com*In-reply-to*: <000e01c070de$7fb9af60$4a8d6395@family>*References*: <3A4A0DBA.BC28F9CE@voynich.nu> <200012271855.eBRIt0J00982@coruja.dcc.unicamp.br> <3A4A698E.4A60F050@voynich.nu> <200012280057.eBS0vWr01605@coruja.dcc.unicamp.br> <000e01c070de$7fb9af60$4a8d6395@family>*Reply-to*: stolfi@xxxxxxxxxxxxx*Sender*: jim@xxxxxxxxxxxxx

> [John Grove:] I find it difficult to believe the author would > have created such a cumbersome code as a 6000 word dictionary - > what if he misplaced his dictionary? That is indeed a problem with any codebook-based theory. As far as I know, such codes are useful only for short messages where 100% secrecy is absolutely essential. I still haven't heard of any pre-computer text as long as the VMS that was entirely encrypted by such a method. The VMS (or, what survived of it) contains about 35,000 tokens, i.e. about 150 tokens per page on the average --- some 15-20 lines, I would guess. Now suppose that the author/reader needs to look up every other token in the dictionary (the rest having been memorized), and that it takes 30 seconds to do so. That comes out to about half an hour for each (very short) page. Phew! On the other hand, a person who invents a new wonderful artificial language or encryption method may indeed be sufficiently motivated to undertake such a task, just as a demonstration. I suppose that Dalgarno, for instance, did write a complete dictionary for his artificial language --- which must have had at least 5000 words to be usable --- and must have composed some longish texts in it. > Does this binomial system account for the lack of doublets? The basic bit-position code (using a distinct symbol for each bit position) in fact ensures that there will be no repeated symbols within the same word, adjacent or not. And indeed the VMS words seem to obey this restriction to some extent. For instance, even though 50% of the tokens have a gallows letter, there are almost no tokens with two gallows (one would expect 25% of the tokens to have them). This may apply to other letters too; I'll check. Now consider the modified bit-position encoding described in a previous message: with even digits on the left side of the #-marker, odd digits on the right side, both divided by 2 (truncating). In this encoding, there are some repeated digits between the two halves, but none within each half. This example is meant to show that the binomial length distribution is compatible with a limited amount of letter repetition. > I don't know what the math would do with such a system, but what > if you created two 6x6 tables and filled them both with one 17 > letter (plaintext) alphabet, the numbers 0-9, and the most > frequently used letters filling in the remaining nine squares? > The first table would be labelled according to the VMS character > set found at the beginning of words, the second by word final > characters. When to switch from the first table to the other > thus bringing the encrypted word to a close could be random. > Whenever the second table is used a space is added to the > encryption. Am I right in assuming this would give you the > binomial wordlength you've been looking at? To account for the layer structure of words, you cannot use the same table twice in a row. You would need about six tables (dealer prefix, bench prefix, gallows, bench suffix, dealer suffix, final group), each mapping a letter to a string of zero or more VMS symbols of the proper type; and use each table once, in sequence. Each table must include the empty string as one of the codes. For instance, denoting the empty string by (), we could use Plaintext A B C D E F G H Table 1 () o qo ol qol or qor ... Table 2 () ch sh che she cho sho ... Table 3 () k t ke te cth ckh cthe ckhe ... Table 4 () ch sh che she ... Table 5 () d l r s od or ol ... Table 6 () y oin oiin oir oiir am .... To account for the binomial wordlength distribution, in each table the number d_k of codes of each length k must be symmetrical, preferably bell-shaped. A symmetrical set for table 3 would be, for example, { () k t ke te ckh cth ckhe cthe ckhhe }, which has length distribution d_k = (1,2,2,2,1). I don't know whether it is possible to tweak all the tables (by playing with the <e> and <o> modifiers) to have symmetric ditributions, and also get the correct mean and maximum word length. However, the code as given above is inadequate because it assumes that all plaintext words have 6 letters. Moreover, the code is ambiguous: for instance, "ch" could be either ABAAAA or AAABAA. Finally, in some slots of the layered-structure model there doesn't seem to be enouh possible codes to account for all letters of the alphabet. So the true encoding is probably more subtle that that... > Well, I'm not a cryptologist but I would like to think that a > cipher system sounds more logical than a codebook of 6000 words > - and I still would rather it turned out to be a natural > language. Me too... All the best, --stolfi

**Follow-Ups**:**Re: On the word length distribution***From:*Big Jim

**References**:**RE: On the word length distribution***From:*Rene Zandbergen

**Re: On the word length distribution***From:*Jorge Stolfi

**Re: On the word length distribution***From:*John Grove

- Prev by Date:
**Re: Voynich -- Opening The Doors #2** - Next by Date:
**Re: Voynich -- Opening The Doors #2** - Previous by thread:
**Re: On the word length distribution** - Next by thread:
**Re: On the word length distribution** - Index(es):