[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: WG: average word length in VMS

To: voynich@xxxxxxxx
Subject: Re: WG: average word length in VMS
From: Jorge Stolfi <stolfi@xxxxxxxxxxxxx>
Date: Tue, 26 Sep 2000 23:09:59 -0300 (EST)
Delivered-to: reeds@research.att.com
In-reply-to: <39D0B94D.55710F91@alphalink.com.au>
References: <BPEOIKLPOIDECCHIOEMCGEEGCAAA.Claus_Anders@t-online.de> <39D02EA1.6011D2B3@mail.msen.com> <39D09ABF.DF482B07@gte.net> <39D0B94D.55710F91@alphalink.com.au>
Reply-to: stolfi@xxxxxxxxxxxxx
Sender: jim@xxxxxxxxxxxxx
    > Er... phonosyntactic oddity? You mean the way in which the
    > letters or groups of letters presumably representing sounds
    > combine together? Jorge Stolfi has done that and he has come up
    > with something which looks very much like Chinese -- the
    > infamous "Chinese hypothesis". It sure does look the spit and
    > image of Chinese to me.
    
Thanks Jacques for the lead.  Since you have been all bored to death
by my Chinese elucubrations, I guess that one more message can't make
things much worse.

I'll assume that you have read my pages about the
core-mantle-crust (KMC) word structure model,
http://www.dcc.unicamp.br/~stolfi/voynich/00-06-07-word-grammar/

Gabriel and others have wondered whether similar structures could be
found in natural languages like Latin or English. I don't know how to
perform this "control experiment" meaningfully, since I didn't use any
algorithm to extract the structure.  I just kept poring at the
statistics and tweaking the grammar until I was satisfied with the
result (a simple enough grammar with an accurate enough match to the
observed vocabulary).

However, I will be very surprised if any KMC-like structure is ever
found in Latin or English words. The key features of the latter are
that there is only one gallows per word at most, that the "chairs"
always occur next to the gallows (if there is one), and that the other
letters occur mostly at the end of the word, after the gallows and
tables. (There are other non-trivial rules governing the placement of
<q> <e> <a> <o> <y> <ii> <m> etc. --- my earlier "OKOKOKO" model ---
but these rules seem to have a very local scope, and could be just
consequences of the mapping from the "true" Voynichese alphabet to
EVA.)

I can't see how the key features above could be found in Latin, under
any simple encoding. As far as I know, there is no letter that is
constrained to occur at most once in each Latin word (and yet occurs
on every other word!). More generally, there seems to be no
tripartition of the alphabet into core, mantle, and crust subsets,
with positions nested as specified by the KMC model.

On the other hand, there is an obvious partition of the alphabet in
two classes V and C --- the vowels and consonants --- that tend to
alternate within the word. Indeed, this bipartition is so obvious that
it can be identified by Sukhotin's algorithm, working only with the
digraph frequencies.

If I understood Jacques's description correctly, Sukhotin's algorithm
looks for two subsets C and V of the alphabet that maximize the
frequency of CV and VC transitions in the words. I seem to recall that
Sukhotin's algorithm applied to Voynichese produced only a few
unconvincing results that led nowhere (probably echoes of the OKOKOKO
model). Part of the problem may have been the multiletter
Voynichese->EVA encoding, which tends to obscure the C-V alternations.
But even if we were to use the true Voynichese alphabet, the KMC
strucure would probably prevent the algorithm from finding 
enough CV and VC transitions to call home about.

On the other hand, the KMC structure is not unlike the structure of
single *syllables* in Latin and other natural languages. Syllable
boundaries are partly a matter of convention; but, off of my head, I
would guess that the Latin syllable can be said to have the general
structure SCRVVN where all letters are optional except for one V; and
S, R, and N are specific subsets of the consonnats:

  in prin ci pio cre a vit de us cae lum et te rram 
  te rra au tem e rat i na nis et va cu a et te ne brae 
  su per fa ci em a by ssi ...
  
So it is tempting to identify the core letters K (gallows) of the KMC
model with the main consonant C of the syllable; the mantle letters M (chairs)
with the secondary consonants S and R; the crust letters C (dealers) with the
vowels; and the final groups <iin>, <in>, etc. with the final consonants N.

This theory has some strengths; for instance it seems to fit the
observation that dealers (=vowels) often occur alone, whereas gallows,
tables, and finals (=consonants) almost never do. It also can be
stretched to fit the existence of crust prefixes (=vowels before the
main consonant), which are present only rarely, and almost never have
more than one dealer: they could be lone unstressed vowels that the
author may plausibly have felt that they belonged to the next syllable
("ina-nis  aby-ssi" instead of "i-na-nis a-by-ssi").

Unfortunately I haven't been able to take this idea very far. That
doesn't mean much since I didn't really try, and anyway the general
"syllabic Latin" theory still leaves many knobs to be set.

One problem that gets me stumped is the circles <aoy> ---
I can't see what features of the Latin syllable could correspond 
to them.  Also there is a non-negligible number of words 
with two chairs after the gallows. Also, in the Latin syllable
the sets S, R, and N are non-disjoint, and are subsets of C; 
whereas gallows, chairs, dealers and finals are disjoint. 

Enter the Chinese hypothesis:

Some of these difficulties get resolved (but others get created) if we
assume that Voynichese is an East Asian language such as Chinese (any
dialect), Vietnamese, Khmer, Burmese, Tibetan, etc..

In particular, the  modern Mandarin syllable has the structure CYVVN
where all parts are optional, Y is a glide and N is a final consonant
("n", "ng", rarely "r"):

  lu3 xun4 shi4 jin4 dai4 shi3 shang4 zui4 you3 ying3 xiang3 li4 de wen2
  xue2 jia1 gen1 pi1 ping2 jia1 zhi1 yi1 yi1 ba1 ba1 yi1 nian2 chu1
  sheng1 zai4 zhe4 jiang1 shao4 xing1 yi2 ge xiang1 dang1 fu4 yu4 de

It is tempting to conjecture that C or CY corresponds to the gallows
and chairs (core+mantle), YVV or VV to the suffix dealers, and N to
the final groups. In Chinese too there are unstressed single-vowel
syllables that the author may have chosen to attach to the following
syllable, thus explaining the occasional crust prefix. Unlike Latin,
the three components consist of (mostly) disjoint sets of sounds.

Moreover the Chinese syllable has a tone (one of four pitch patterns),
which is just as important as the consonant. The tone is denoted by a
diacritic or a digit superscript in the modern phonetic script pinyin
(see sample above). Matteo Ricci's spelling system (Macao, ~1585)
apparently used a combination of diacritics, similar to (but simpler)
than that of modern Vietnamese (itself a Jesuit design, ~1620 or
earlier). Between these two there fourished a few other systems,
invented by diacritophobic anglophones, which used dummy consonants or
omitted tones altogether.

You can denote tone with diacritics, superscipts and dummy consonants
only after you have learned how many significantly different tones
there are. If you are still trying to learn the language by ear,
without the benefit of a textbook (or if you are a linguist comparing
languages with different tone systems), you would probably use a
verbose encoding of tone, based on pitch levels instead of pitch
patterns. Thus you may write the Mandarin syllable "ma" in the third
tone as m2a1a3 or m2a13 or 2m1a3, "213" indicating a "mid-low-high"
pitch pattern. I am grasping at the conjecture that the <aoy> serve
precisely such purpose; one point in favor is that they are inserted
chiefly either at the beginning of the syllable, or in the crust
(vowel) suffix; and that is where pitch marks seem to belong.

There are other arguments for Chinese, which I have posted before and
would not repeat here. Let me just observe that, in all the relevant
East Asian languages, the syllable is indeed a unit of meaning; so we
wouldn't have to explain why the author chose to separate syllables
instead of words --- and why most labels seem to be single syllables.

There are problems with Mandarin, though. The number of Mandarin
consonants is much less than the number of core+mantle patterns that
occur in the VMS. Moreover, it seems a bit unlikely that the author
would have used combinations like <ckheshe> to denote a single
consonant. (However, the gwoyeh romatzyh spelling used in Taiwan is
almost that bad. Or perhaps the author was German... 8-) Also Mandarin
has only one or two finals, whereas Voynichese has three or four
common ones, and a few rare ones.

Of course, even if the language is East Asian, it is certainly not
modern Mandarin. At most it would be Mandarin as spoken in the 1500's.
>From a small sample of Ricci's notation (reproduced on the cover of
Jonathan Spence's book), it looks like the syllable structure was
basically the same; but I see five diacritics, so perhaps there were
more than 4 tones back then. I also see a "chum", so there may have
been more final sounds too.

However, the first European visitors to China in modern times were for
a long time confined to Macao, in the Cantonese-speaking region.
Cantonese has eight tones and a richer set of finals (-k, -t, -p,
etc.)  Those may fit the number of final groups in Voynichese.
However the limited number of consonants is still a problem.

Then there are other languages in the region. Linguists nowadays say
that Vietnamese and Chinese are unrelated, but that is one part of
reality that I stubbornly refuse to acknowledge. Not that it matters,
though: Vietnamese is syllable based, has six tones, and more or less
the same syllable structure as Mandarin; so it could do as well. In
fact it has more finals and vowel combinations, and even consonant
clusters like "tr" and "kr". Ditto, in varyning degrees, for Thai,
Khmer, Burmese, French 8-), and a few others languages spoken in
Southeast Asia.

It is known that the Portuguese arrived in that region a few decades
before landing in Macao (~1510), a date that could take some strain
away from the chronology. Unfortunately I could not find any
information about those early contacts.

I have been told that Manchu and Mongolian, although they are
unrelated to Chinese, may fit the bill too. But the best candidate
outside China may be Tibetan. It is a syllabic language with a
rudimentary tone system, consonant clusters, and a modest set of final
sounds. It has a native script, derived from some Indian model, which
is alphabetic but is said to be extremely un-phonetic. Curiously, the
tones are denoted (inconsistently) by prefixing certain dummy
consonants to the syllable, like b and r in "'byung rtsis"
("astronomy"). (Linguists claim that those dummy consonants were
originally pronounced as they are written, and mutated into tones only
a few centuries ago. Now this is another part of reality that I refuse
to accept. Some linguists even claim that there is a remote dialect of
Tibetan where those consonants are still pronounced. It is amazing
what one can get natives to do in exchange of a few packets of
Marlboro.)

(According to one source, the pleiades are called "sMen-du's" in Tibetan.
Can we match that to EVA <doaro>?)


    
    > 
    > 1. Assuming that it is Chinese, which variety of Chinese?
    >    There are dozen of varieties of Chinese, all really
    >    different languages, mutually unintelligible. Plus,
    >    four hundred years ago they were certainly rather
    >    different from what they are today.
    > 
    > 2. It is not necessarily Chinese. My pet "serious" theory
    >    (no tongue in cheek for once) is an extinct language
    >    isolate, just like Basque, or Etruscan, but of course
    >    totally unrelated to either, and which happened to
    >    have a phonological  structure reminiscent of Chinese.
    >    I am persuaded that there were hundreds of such languages
    >    in Europe alone once. In other words, that the linguistic
    >    picture was very much like Papua New Guinea now. If you
    >    are after secrecy, it is a much better "cipher" than
    >    anything available at the time. A "Navaho code", as it were.
    >
Follow-Ups:
- Re: WG: average word length in VMS
  - From: Jacques Guy
References:
- WG: average word length in VMS
  - From: Claus Anders
- Re: WG: average word length in VMS
  - From: Bruce Grant
- Re: WG: average word length in VMS
  - From: Brian Eric Farnell
- Re: WG: average word length in VMS
  - From: Jacques Guy
Prev by Date: Re: WG: average word length in VMS
Next by Date: Re: WG: average word length in VMS
Previous by thread: Re: WG: average word length in VMS
Next by thread: Re: WG: average word length in VMS
Index(es):
- Date
- Thread