[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Chinese theory redux!

Hi folks, 

First the news: I have just added a new page on my VMS site
(the first one since dec/2000):


I was kept away from VMS haching and flaming partly by the so-called
real work, partly because I decided to write a technical report on
Voynichese statistics and word structure, and finally --- I admit ---
because with the apparent death of the Chinese Theory I felt "lost in
the woods without my dog", as they say around here.

In fact I was a bit disapointed at the lack of repercussion of both 
discoveries.  But that was allright, because I just found out that 
the last one was quite wrong (yes, again. Oh well...)

Actually, the fact itself still stands: the Voynichese word length
distribution is almost exactly like the binomial distribution
binom(9,k-1) (i.e. the number of distinct words with k letters 
is proportional to the chance of obtaining k-1 heads in 9 coin tosses),
and quite unlike that of Latin, English, or any other "ordinary"

My mistake was to assume, without checking, that such symmetry ruled
out *any* natural language, and could only be the result of some
codebook-based system. But yesterday, while preparing some plots for
that technical report, I found that I was flatly wrong.  Thus the 
Chinese Theory is not only still alive, but in fact has been 
further strengthened by that binomial stuff. Please see the details
at the site...

  - - -

By the way, the discussion on the "true" Voynichese alphabet is both
exciting and frustrating; if only I had the time to reply to all posters. I
must have wasted months of brain and computer time, literally, trying
to answer that question. That work told me a lot about the VMS word
structure; yet I still cannot make up my mind on whether "aiin" is one
letter, six letters, or anything in between.

In fact, what little progress I made on that question came not through
studying the VMS itself, but by looking very closely at other natural
languages an scripts. Namely, I realized that the approach I have been
using through all these years is flawed in principle and is almost
guaranteed *not* to work. 

I am referring here to statistical analysis of characters, character pairs,
n-grams, and other character patterns like that.  That cover the
spectrum from simplistic arguments like "`q' and `u' always
appear together, so `qu' must be a single letter", through Sukhotin's
algorithm, H_k analysis, and beyond.  

The flaw comes from the fact that natural languages and their scripts
naturally evolve towards greater efficiency; and an absolutely
efficient language would look like a zipped file --- a stream of
random independent bytes. So, by analyzing n-gram frequencies and
such, we are not studying the useful contents of the text, but rather
the defects in its delivery medium --- like a "cheese analyzer" that
in fact can only measure its holes.

Moreover, since those defects are not important to the persons involved, they
are easily lost --- like cheese holes in a pizza oven --- whenever the
message is recast in another medium, such as a newly invented script.
They are also easily masqued or scrambled by change 

Finally, when  meaningful language structures that do have statistically 
visible effects generally have fairly complex rules and scopes,
extending over several characters or words.  Now, when the words are
chopped up into n-grams for analysis, the effects of those structural rules
too get chopped up and mixed together, So that any pattern that can be
seen in the n-gram counts is probably the sum of the indirect
effects of several structural rules.

To see what I mean, consider Roman numerals. Compared to natural
languages, they have an extremely simple structure, which manifests
itself as strong constraints on the number and order of its symbols.
The Arabic-Roman conversion algorithm can be precisely described in a
couple of sentences, and anyone who learns it will automatically know
o parse a numeral. But now imagine a Martian who got hold of a few
thousand Roman numbers, in random order, and is trying to discover the
rules by statistical analysis of their n-grams and such. Can you see a
path that could take him from those tabutations to the rules for
parsing numbers like XCIX, CXC, CXV, or LXV? Can he use such that to
identify the "true" Roman-number alphabet --- whatever that is?
Sukhotin's algorithm will surely suggest interesting vowel/consonant
splits: will that help him, or only confuse him more?

Putting it all thogether, it would seem that I have spent a significant slice
of my life trying to answer the question `Holstein or Yorkshire' by
studying the shape of the holes in the grated cheese on my pizza...

  ( Oops, that analogy reminds me that it is well past dinner time. 
    Enough for today; thanks for reading this far.  
    All the best,
     --stolfi 8-)