[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

*To*: rene@xxxxxxxxxx*Subject*: Re: On the word length distribution*From*: Jorge Stolfi <stolfi@xxxxxxxxxxxxx>*Date*: Wed, 27 Dec 2000 22:57:32 -0200 (EDT)*Cc*: voynich@xxxxxxxx*Delivered-to*: reeds@research.att.com*In-reply-to*: <3A4A698E.4A60F050@voynich.nu>*References*: <3A4A0DBA.BC28F9CE@voynich.nu> <200012271855.eBRIt0J00982@coruja.dcc.unicamp.br> <3A4A698E.4A60F050@voynich.nu>*Reply-to*: stolfi@xxxxxxxxxxxxx*Sender*: jim@xxxxxxxxxxxxx

Rene, here I will answer some of the less important points you make in your message. But there is one really intersting point which I will save for a separate message. > I'm with you now. By imposing that every possible combination > exists, the probabilities (in the thesaurus, i.e. the list of > words) are actually forced to 0.5. Yes. > That's also why you may only have two options, either empty or > 'one specific character'. To clarify: the characters must be all distinct, and the rule "one character per slot" must be explicitly assumed. For instance, suppose the characters are ABC (one per slot), the possible words, sorted by length, are k W_k words - --- ------------------- 1 1 # 2 3 #A #B #C 3 3 #AB #AC #BC 4 1 #ABC If we allow two characters per slot, say [Aa][Bb][Cc], then the possible words are k W_k words - --- ------------------- 1 1 # 2 6 #A #a #B #b #C #c 3 12 #AB #Ab #aB #ab #AC #Ac ... #cb 4 8 #ABC #ABc #AbC ... #abc Thus the symmetrical binomial distribution seems to require one choice per slot only. > I have been assuming that not all combinations do exist. This > causes interesting problems. And they do not; see for example the list of 2-character words that I posted earlier today. Note that the `bit position' encoding is rather redundant, because the position symbols are all distinct and can be permuted without changing the numerical value. It is not hard to remove the redundancy by a length-preserving re-encoding, and this process may explain the complicated structure of the VMS words. For instance, take the variant of the bit-position code which is described in my page, with even digits increasing on the left, odd digits decreasing on the right: Binary 10100 10101 10110 10111 11000 11001 11010 11011 11100 11101 ... Code 24# 024# 24#1 024#1 4#5 04#5 4#51 04#51 4#53 04#53 ... Now, since we know which digits are even and which are odd, we can divide everything by 2, truncating: Binary 10100 10101 10110 10111 11000 11001 11010 11011 11100 11101 ... Code' 12# 012# 12#0 012#0 2#2 02#2 2#20 02#20 2#21 02#21 ... This code is now a bit more VMS words: single-hill profiles, but the same letters can occur on either side of the hilltop #. Note that this re-encoding is one-to-one and preserves word length, so it still has a binomial word-length distribution. (We still have some redundancy because the prefix and suffix are monotonic. We could remove it by encoding only the differences, except for the ends: Binary 10100 10101 10110 10111 11000 11001 11010 11011 11100 11101 ... Code'' 11# 011# 11#0 011#0 2#2 02#2 2#20 02#20 2#11 02#11 ... However that would destroy the single-hill property.) The point of this example is to show that even a simple re-encoding of the bit-position code can thoroughly disguise the original N-slot mechanism. > Let's see. If one wants to encode a plaintext using a scheme along > the lines you're suggesting, one could build a dictionary in > advance or 'on the fly'. The latter would be easy in the computer > age, rather hard before that. Perhaps not that hard. I would maintain the dictionary as a code->word list (in numerical order, for decoding) and also as a stack of word-code library cards (in alphabetical order, for encoding). Whenever I ran into a new word, I would add it to the list, assigning to it the next code, and add the corresponding card to the stack. Granted, this process would be rather slow; but that is a general problem for any codebook-based theory. > I suppose that in practice it would be a combination: make a > good starting dictionary and then add words on the fly as they > are needed. Perhaps. The token length distribution suggests that the most common words were assigned to short codes. That could be the result of either strategy. > In any case, the source text would have far more than 512 > different words. Note that in fact the word counts for each length are actually about 12 times the pure binomial model. So the dictionary has in fact about 6000 distinct words. > If we take this further, there should be a set of 12 characters > (nice number!) of which every Voynich word should have: > - exactly one (simple scenario) > - at least one (complicated scenario) Unfortunately that is not the case. It looks more like some variant of the "diacritics" solution. E.g., start with the basic bit position code # #A #B #AB #C #AC #BC #ABC #D #AD ... then vary the first letter (only!) between upper or lower case: # #A #a #B #b #AB #aB #AC #aC #BC #bC #ABC #aBC #D #d This results in a word length distribution that is exactly twice the pure binomial, W_k = 2*choose(N,k-1) (except for W_1, the single code #). I suspect that this sort of trick was used in the VMS, only that with more bits -- so that the factor came out 12 instead of 2. All the best, --stolfi

**Follow-Ups**:**Re: On the word length distribution***From:*John Grove

**References**:**RE: On the word length distribution***From:*Rene Zandbergen

- Prev by Date:
**Re: Voynich -- Opening The Doors #1** - Next by Date:
**VMS words and Roman numerals** - Previous by thread:
**RE: On the word length distribution** - Next by thread:
**Re: On the word length distribution** - Index(es):