[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: On the word length distribution

To: rene@xxxxxxxxxx
Subject: Re: On the word length distribution
From: Jorge Stolfi <stolfi@xxxxxxxxxxxxx>
Date: Wed, 27 Dec 2000 22:57:32 -0200 (EDT)
Cc: voynich@xxxxxxxx
Delivered-to: reeds@research.att.com
In-reply-to: <3A4A698E.4A60F050@voynich.nu>
References: <3A4A0DBA.BC28F9CE@voynich.nu> <200012271855.eBRIt0J00982@coruja.dcc.unicamp.br> <3A4A698E.4A60F050@voynich.nu>
Reply-to: stolfi@xxxxxxxxxxxxx
Sender: jim@xxxxxxxxxxxxx

Rene, here I will answer some of the less important points
you make in your message.  But there is one really intersting 
point which I will save for a separate message.

    > I'm with you now. By imposing that every possible combination
    > exists, the probabilities (in the thesaurus, i.e. the list of
    > words) are actually forced to 0.5.
    
Yes.

    > That's also why you may only have two options, either empty or
    > 'one specific character'.
    
To clarify: the characters must be all distinct, and 
the rule "one character per slot" must be explicitly assumed.

For instance, suppose the characters are ABC (one per slot),
the possible words, sorted by length, are

  k  W_k  words
  -  ---  -------------------
  1   1   #  
  2   3   #A   #B   #C 
  3   3   #AB  #AC  #BC
  4   1   #ABC
   
If we allow two characters per slot, say [Aa][Bb][Cc],
then the possible words are

  k  W_k  words
  -  ---  -------------------
  1   1   #
  2   6   #A #a #B #b #C #c
  3  12   #AB #Ab #aB #ab #AC #Ac ... #cb
  4   8   #ABC #ABc #AbC ... #abc

Thus the symmetrical binomial distribution seems to require 
one choice per slot only.
    
    > I have been assuming that not all combinations do exist. This 
    > causes interesting problems. 

And they do not; see for example the list of 2-character words
that I posted earlier today.  

Note that the `bit position' encoding is rather redundant, because the
position symbols are all distinct and can be permuted without changing
the numerical value. It is not hard to remove the redundancy by a
length-preserving re-encoding, and this process may explain the
complicated structure of the VMS words.

For instance, take the variant of the bit-position code 
which is described in my page, with even digits increasing 
on the left, odd digits decreasing on the right:

  Binary  10100 10101 10110 10111 11000 11001 11010 11011 11100 11101 ...
  
  Code    24#   024#  24#1  024#1 4#5   04#5  4#51  04#51 4#53  04#53 ...
  
Now, since we know which digits are even and which are odd, we can 
divide everything by 2, truncating:

  Binary  10100 10101 10110 10111 11000 11001 11010 11011 11100 11101 ...
  
  Code'   12#   012#  12#0  012#0 2#2   02#2  2#20  02#20 2#21  02#21 ...

This code is now a bit more VMS words: single-hill profiles, but the
same letters can occur on either side of the hilltop #. Note that this
re-encoding is one-to-one and preserves word length, so it still has a
binomial word-length distribution.

(We still have some redundancy because the prefix and suffix are
monotonic. We could remove it by encoding only the differences, except
for the ends:

  Binary  10100 10101 10110 10111 11000 11001 11010 11011 11100 11101 ...
  
  Code''  11#   011#  11#0  011#0 2#2   02#2  2#20  02#20 2#11  02#11 ...
 
However that would destroy the single-hill property.)

The point of this example is to show that even a simple re-encoding of
the bit-position code can thoroughly disguise the original N-slot
mechanism.

    > Let's see. If one wants to encode a plaintext using a scheme along
    > the lines you're suggesting, one could build a dictionary in 
    > advance or 'on the fly'. The latter would be easy in the computer
    > age, rather hard before that.
    
Perhaps not that hard. I would maintain the dictionary as a code->word
list (in numerical order, for decoding) and also as a stack of
word-code library cards (in alphabetical order, for encoding).
Whenever I ran into a new word, I would add it to the list, assigning
to it the next code, and add the corresponding card to the stack.

Granted, this process would be rather slow; but that is a general 
problem for any codebook-based theory. 
    
    > I suppose that in practice it would be a combination: make a
    > good starting dictionary and then add words on the fly as they
    > are needed.

Perhaps. The token length distribution suggests that the most common
words were assigned to short codes. That could be the result of either
strategy.

    > In any case, the source text would have far more than 512
    > different words.
    
Note that in fact the word counts for each length are actually about
12 times the pure binomial model.  So the dictionary has
in fact about 6000 distinct words.

    > If we take this further, there should be a set of 12 characters
    > (nice number!) of which every Voynich word should have:
    > - exactly one (simple scenario)
    > - at least one (complicated scenario)

Unfortunately that is not the case.  It looks more like some
variant of the "diacritics" solution.  E.g., start with the 
basic bit position code

  #  #A  #B  #AB  #C  #AC  #BC  #ABC  #D  #AD ...
  
then vary the first letter (only!) between upper or lower case:

  #  #A #a  #B #b  #AB #aB  #AC #aC  #BC #bC  #ABC #aBC #D #d 
  
This results in a word length distribution that is exactly twice the 
pure binomial, W_k = 2*choose(N,k-1) (except for W_1, the single code #).
I suspect that this sort of trick was used in the VMS, only that 
with more bits -- so that the factor came out 12 instead of 2. 

All the best,

--stolfi

Follow-Ups:
- Re: On the word length distribution
  - From: John Grove

References:
- RE: On the word length distribution
  - From: Rene Zandbergen

Prev by Date: Re: Voynich -- Opening The Doors #1
Next by Date: VMS words and Roman numerals
Previous by thread: RE: On the word length distribution
Next by thread: Re: On the word length distribution
Index(es):
- Date
- Thread