[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Is the VMs Undecipherable?



	I have some thoughts on the "unicity distance" concept
several people mentioned to me.  This is the length of
a string in a language/orthography/cipher system that
has one, unique plaintext.  We currently know of no way
to prove that the VMs is gibberish or that there is
insufficient material for decipherment.  A modified
"unicity distance" might provide us those things.

	At first glance it seemed unlikely.  One link noted
that the unicity distance is usually quite short,
rarely exceeding 60-70 characters, since natural
languages are highly redundant.  However, consider this
example:

JIWEN LIU's Home Page
http://www2.ics.hawaii.edu/~jiwen/ics623/ics623.html

	Step 1) Suppose that the unicity distance = n bytes;
	Step 2) H = 1.5 bits/per letter;
		Hence The number of messages that might be accepted
2^(n*H)
	Step 3) Suppose that The Plain English cleattext is 
		little characters: 'a'~'z' plus blank, 27
		Hence, the possible message: 27^n
	Step 4) The key: 27!
	Step 5) Upon the formular:
		1 = 27! *2^(1.5*n)/27^n
		log27!+1.5*n*log2-n*log27=0
		n = 28.61
	Step 6) The unicity distance = 28.61 bytes

	This treatment of unicity distance seems to say that
you could have any distribution of characters within a
token - definitely not true in natural languages! 
Natural languages don't work like this. In natural
languages are such constraints within words/tokens as:

1)  Vowel/consonant alternation.

2)  Syllabic constraints.  Japanese is an extreme
example.  In Japanese a syllable may have:
	a) Zero or one of 17 consonants,
	b) One of 10 vowels (remember long/short), and
	c) One of 
		i)   Zero phonemes,
		ii)  -n, or
		iii) the double of a stop (at most one of 6)
beginning the next syllable.

3)  Permitted but not seen.  Even in a grid like one
might draw for Japanese kana (syllables), not all
permitted combinations actually occur.  The same is
true of the various proposed Voynichese word paradigms.

4)  Other types of phonotactic and graphotactic
constraints.  In English #sht- is impossible but in
High German it is common.  The same "permitted but not
seen" applies here too.

5) Rules of word formation.   There are derivational
and inflectional prefixes, infixes, and suffixes. 
Compound words are formed from smaller ones.  

	Beyond constraints within words, a phenomenon
extensively studied for Voynichese, rules of syntax
determine permissible word sequences in a given natural
language.  The definition of unicity distance takes
none of this into account.  If one wants to define *the
minimum text size needed for a single, unambiguous
decipherment*, somehow one must include all these
constraints.  

	Perhaps one could define a modified unicity distance
that means the minimum size of a text that has one
plaintext in a given language.  Since the word
structure of Voynichese has been extensively studied,
we could calculate a modified unicity distance for
Voynichese that does not include the effect of syntax. 
If we ever study the syntax of Voynichese, we could
include that too.  But I'd have to think about this a
lot more to have any idea on how to do it. 

	The floor is open, especially to those who know a lot
more about these matters than I do.

Dennis