[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: On the word length distribution

To: "John Grove" <John@xxxxxxxxxxxx>
Subject: Re: On the word length distribution
From: Jorge Stolfi <stolfi@xxxxxxxxxxxxx>
Date: Thu, 28 Dec 2000 16:34:22 -0200 (EDT)
Cc: voynich@xxxxxxxx
Delivered-to: reeds@research.att.com
In-reply-to: <000e01c070de$7fb9af60$4a8d6395@family>
References: <3A4A0DBA.BC28F9CE@voynich.nu> <200012271855.eBRIt0J00982@coruja.dcc.unicamp.br> <3A4A698E.4A60F050@voynich.nu> <200012280057.eBS0vWr01605@coruja.dcc.unicamp.br> <000e01c070de$7fb9af60$4a8d6395@family>
Reply-to: stolfi@xxxxxxxxxxxxx
Sender: jim@xxxxxxxxxxxxx

    > [John Grove:] I find it difficult to believe the author would
    > have created such a cumbersome code as a 6000 word dictionary -
    > what if he misplaced his dictionary?
    
That is indeed a problem with any codebook-based theory. As far as I
know, such codes are useful only for short messages where 100% secrecy
is absolutely essential. I still haven't heard of any pre-computer
text as long as the VMS that was entirely encrypted by such a method.

The VMS (or, what survived of it) contains about 35,000 tokens, i.e.
about 150 tokens per page on the average --- some 15-20 lines, I would
guess. Now suppose that the author/reader needs to look up every other
token in the dictionary (the rest having been memorized), and that it
takes 30 seconds to do so. That comes out to about half an hour 
for each (very short) page. Phew!

On the other hand, a person who invents a new wonderful artificial
language or encryption method may indeed be sufficiently motivated to
undertake such a task, just as a demonstration. I suppose that
Dalgarno, for instance, did write a complete dictionary for his
artificial language --- which must have had at least 5000 words to be
usable --- and must have composed some longish texts in it.

    > Does this binomial system account for the lack of doublets?

The basic bit-position code (using a distinct symbol for each bit
position) in fact ensures that there will be no repeated symbols
within the same word, adjacent or not. And indeed the VMS words seem
to obey this restriction to some extent. For instance, even though 50%
of the tokens have a gallows letter, there are almost no tokens with
two gallows (one would expect 25% of the tokens to have them). This
may apply to other letters too; I'll check.
 
Now consider the modified bit-position encoding described in a
previous message: with even digits on the left side of the #-marker,
odd digits on the right side, both divided by 2 (truncating). In this
encoding, there are some repeated digits between the two halves, but
none within each half. This example is meant to show that the binomial
length distribution is compatible with a limited amount of letter
repetition.

    > I don't know what the math would do with such a system, but what
    > if you created two 6x6 tables and filled them both with one 17
    > letter (plaintext) alphabet, the numbers 0-9, and the most
    > frequently used letters filling in the remaining nine squares?
    > The first table would be labelled according to the VMS character
    > set found at the beginning of words, the second by word final
    > characters. When to switch from the first table to the other
    > thus bringing the encrypted word to a close could be random.
    > Whenever the second table is used a space is added to the
    > encryption. Am I right in assuming this would give you the
    > binomial wordlength you've been looking at?
    
To account for the layer structure of words, you cannot use the same
table twice in a row. You would need about six tables (dealer prefix,
bench prefix, gallows, bench suffix, dealer suffix, final group), each
mapping a letter to a string of zero or more VMS symbols of the proper
type; and use each table once, in sequence. Each table must include
the empty string as one of the codes.

For instance, denoting the empty string by (), we could use

    Plaintext  A    B    C    D    E    F    G    H 
    Table 1    ()   o    qo   ol   qol  or   qor  ...
    Table 2    ()   ch   sh   che  she  cho  sho  ...
    Table 3    ()   k    t    ke   te   cth  ckh  cthe ckhe ...
    Table 4    ()   ch   sh   che  she  ...  
    Table 5    ()   d    l    r    s    od   or   ol ...
    Table 6    ()   y    oin  oiin oir  oiir am   ....

To account for the binomial wordlength distribution, in each table the
number d_k of codes of each length k must be symmetrical, preferably
bell-shaped. A symmetrical set for table 3 would be, for example, 
{ () k t ke te ckh cth ckhe cthe ckhhe }, which has length distribution 
d_k = (1,2,2,2,1). I don't know whether it is possible to tweak
all the tables (by playing with the <e> and <o> modifiers) to have 
symmetric ditributions, and also get the correct mean and maximum
word length.

However, the code as given above is inadequate because it assumes that
all plaintext words have 6 letters. Moreover, the code is ambiguous:
for instance, "ch" could be either ABAAAA or AAABAA. Finally, in some
slots of the layered-structure model there doesn't seem to be enouh
possible codes to account for all letters of the alphabet. So the true
encoding is probably more subtle that that...

    > Well, I'm not a cryptologist but I would like to think that a
    > cipher system sounds more logical than a codebook of 6000 words
    > - and I still would rather it turned out to be a natural
    > language.

Me too...

All the best,

--stolfi

Follow-Ups:
- Re: On the word length distribution
  - From: Big Jim

References:
- RE: On the word length distribution
  - From: Rene Zandbergen
- Re: On the word length distribution
  - From: Jorge Stolfi
- Re: On the word length distribution
  - From: John Grove

Prev by Date: Re: Voynich -- Opening The Doors #2
Next by Date: Re: Voynich -- Opening The Doors #2
Previous by thread: Re: On the word length distribution
Next by thread: Re: On the word length distribution
Index(es):
- Date
- Thread