[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Curious coincidence

To: "John Grove" <John@xxxxxxxxxxxx>
Subject: Re: Curious coincidence
From: Jorge Stolfi <stolfi@xxxxxxxxxxxxx>
Date: Sat, 10 Jun 2000 16:06:57 -0300 (EST)
Cc: voynich@xxxxxxxx
Delivered-to: reeds@research.att.com
In-reply-to: <000701bfd27a$c5276680$d8916395@outlander>
References: <200006100037.VAA22543@coruja.dcc.unicamp.br> <000701bfd27a$c5276680$d8916395@outlander>
Reply-to: stolfi@xxxxxxxxxxxxx
Sender: jim@xxxxxxxxxxxxx
    > [John Grove:] Coincidence, I think. For example, there are no
    > dkaiin or kdain words (I think) which should happen in a 50-50
    > context over the whole document.
    
But that is the point: the 50-50 split is remarkable precisely because
the distribution of words is *not at all* like that of random strings.
Indeed, by various measures (Rene, Gabriel, Mark...), the word
distribution shows strong variations at all scales --- section, page,
and paragraph. So it is remarkable that the distribution of the
"gallows bit" (which is one of the most striking features of
Voynichese words, both visually and statistically) averages
out so close to 50-50.

It may be just coincidence, as you say. But it is intriguing that the
"codebook" theory predicts this sort of thing. Namely, the
distribution of whole word-codes would be just as irregular and
variable as the distribution of words in the original language; but,
if the codebook itself is random, then the word codes in any text will
be equally split between odd and even, just like a string of ideal
coin tosses. In fact, the codebook doesn't need to be realy random,
merely independent from the text: even an alphabetized dictionary
should give a 50-50 odd/even split.

"Linguistic" explanations don't seem so promising. Most binary
linguistic attributes -- like voiced/unvoiced, front/back,
capital/lowercase, masc/fem, etc. -- can be directly mapped to
binary physiological or mental variables. Therefore the two choices of the
attribute will have different "costs" or meanings, and this 
condition usually implies different distributions.

For instance, a voiced consonant probably takes a trifle longer to
pronounce than an ubvoiced one; it probably consumes more or less lung
air, involves more or less muscles, etc. Add to that all sorts of
spelling and grammar biases that may favor one variant over the other.
Therefore, it is not surprising that voiced/unvoiced pairs have 
very skewed frequencies in typical languages. Some quick counts:

  Portuguese:
    166 p   332 t   61 f
     77 b   347 d   91 v

  Italian:
     94 p   157 t   31 f
     17 b    89 d   51 v

  English:    
     62	p   306 t   64 f
     65	b   126 d   33 v

(From short texts in the standard orthographies --- which, for It. and
Port., is essentially phonetic, at least with regard to these letters.)
   
    > Also, some of those two-g words are labels where they seem most
    > likely to be one word.
    
That is disputable --- why can't they be two-word labels?

    > Although, what you might have found is that the gallows
    > represents a shaddah (Is that the right word?) for showing that
    > one of the characters in the word is doubled. Okay, so I'm a
    > little obsessed with the idea! Also, how does one count a
    > split-g word? Is it one or two-g?
    
I am using the majority-vote reading, so it depends on the transcribers.
I excluded words containing "invalid" characters, including weirdos and
characters where there was no majority consensus. So some of those
split-gallows words were probably not counted. 

In any case, they are too rare to register (a couple dozen cases,
perhaps?)
    
    > Out of those 50% that do have one-g, what is the varying
    > position of the gallow - First, second, third character of a
    > two,three, four letter word...
    
Good question; see the counts below.

Zero gallows letters:

    709 0.04083 x
   2764 0.15919 xx
   4964 0.28590 xxx
   4483 0.25819 xxxx
   2995 0.17249 xxxxx
    929 0.05350 xxxxxx
    422 0.02430 xxxxxxx
     86 0.00495 xxxxxxxx
      9 0.00052 xxxxxxxxx
      2 0.00012 xxxxxxxxxx

One gallows letter:

     26 0.00149 @

    262 0.01502 @x
     36 0.00206 x@

   1010 0.05792 @xx
    767 0.04398 x@x
     87 0.00499 xx@

    934 0.05356 @xxx
   2402 0.13774 x@xx
    877 0.05029 xx@x
     19 0.00109 xxx@

    683 0.03917 @xxxx
   2232 0.12799 x@xxx
   2207 0.12656 xx@xx
    137 0.00786 xxx@x
      3 0.00017 xxxx@

    283 0.01623 @xxxxx
   1391 0.07976 x@xxxx
   2068 0.11858 xx@xxx
    155 0.00889 xxx@xx
     33 0.00189 xxxx@x
      2 0.00011 xxxxx@

    156 0.00895 @xxxxxx
    209 0.01198 x@xxxxx
    910 0.05218 xx@xxxx
    112 0.00642 xxx@xxx
     25 0.00143 xxxx@xx
      5 0.00029 xxxxx@x

     58 0.00333 @xxxxxxx
    107 0.00614 x@xxxxxx
     80 0.00459 xx@xxxxx
     39 0.00224 xxx@xxxx
     13 0.00075 xxxx@xxx
      2 0.00011 xxxxx@xx
      1 0.00006 xxxxxx@x

     15 0.00086 @xxxxxxxx
     36 0.00206 x@xxxxxxx
     32 0.00183 xx@xxxxxx
      8 0.00046 xxx@xxxxx
      3 0.00017 xxxx@xxxx
      2 0.00011 xxxxx@xxx

      1 0.00006 x@xxxxxxxx
      4 0.00023 xx@xxxxxxx
      1 0.00006 xxxxxx@xxx

      3 0.00017 xx@xxxxxxxx
      2 0.00011 xxx@xxxxxxx

      1 0.00006 xxxx@xxxxxxx

Two or more gallows letters:

     25 0.07669 @xx@x
     25 0.07669 @xx@xx
     25 0.07669 x@xx@xx
     18 0.05521 @x@xx
     18 0.05521 @xx@xxx
     18 0.05521 x@xx@x
     17 0.05215 x@x@xx
     16 0.04908 @x@x
     13 0.03988 @xxx@x
     11 0.03374 xx@xx@x
     10 0.03067 @x@xxx
      9 0.02761 x@x@x
      9 0.02761 x@xx@xxx
      7 0.02147 x@xxx@xx
      7 0.02147 xx@x@x
      6 0.01840 @x@xxxx
      6 0.01840 @xxx@xx
      6 0.01840 x@x@xxx
      6 0.01840 xx@x@xx
      5 0.01534 @xx@xxxx
      5 0.01534 @xxx@
      5 0.01534 xx@xx@xx
      5 0.01534 xx@xx@xxx
      4 0.01227 @xxx@xxx
      3 0.00920 @xxxx@xx
      3 0.00920 x@x@xxxx
      3 0.00920 x@xxx@x
      3 0.00920 x@xxxxx@x
      3 0.00920 xx@x@xxx
      2 0.00613 @xxxx@xxxx
      2 0.00613 x@@xxxx
      2 0.00613 x@xx@xxxx
      2 0.00613 xx@xxx@xx
      2 0.00613 xxx@xx@x
      1 0.00307 @@x
      1 0.00307 @x@xx@xx
      1 0.00307 @xx@
      1 0.00307 @xx@x@x
      1 0.00307 @xxx@xxxx
      1 0.00307 @xxx@xxxxx
      1 0.00307 @xxxx@x
      1 0.00307 @xxxxxx@xxx
      1 0.00307 x@@x
      1 0.00307 x@xx@
      1 0.00307 x@xx@xx@xx
      1 0.00307 x@xxx@xxx
      1 0.00307 x@xxx@xxxx
      1 0.00307 x@xxxx@
      1 0.00307 xx@@x
      1 0.00307 xx@x@
      1 0.00307 xx@xx@
      1 0.00307 xx@xxxx@
      1 0.00307 xxx@xx@
      1 0.00307 xxx@xx@xxxx
      1 0.00307 xxxx@x@xx
      1 0.00307 xxxx@xx@
      1 0.00307 xxxx@xx@xxx
      1 0.00307 xxxx@xx@xxxx
      1 0.00307 xxxxx@xx@


Here each "@" is a gallows (simple or platformed), and each "x" is one
non-gallows letter (counting "ch", "sh", and "ee" as a single letter).
The input is the whole VMS text, including circular and radial lines,
minus labels and key-like sequences.

I can easily redo the table it with different criteria, if you prefer.

However, I think that this is not the "right" question to ask. I am
convinced that the Voynichese "code" makes liberal use of compound
"letters", and/or optional pre- and post- letter modifiers --- like
the Roman number system, or typical spelling systems. So, if the
logical "position" of the gallows is an important parameter, almost
certainly it is *not* measured by counting EVA or Currier letters.

(I *must* finish that report...)
    
    > Come to think of it, are there any two character words with a
    > Gallows (ty, ky)?

Very few:

     12 0.00034 ot
      5 0.00014 ok
      4 0.00011 yk
      2 0.00006 op
      1 0.00003 lk
      1 0.00003 lt
      1 0.00003 of

     23 0.00065 ky
     17 0.00048 ty
      6 0.00017 ko
      2 0.00006 kl
      2 0.00006 py
      2 0.00006 to
      1 0.00003 ka
      1 0.00003 ke
      1 0.00003 tl

Again, the corpus is the text minus labels, and the fractions
are relative to the total token count (~34,000).

I bet that many (all?) of these short words are the result of
transcription errors --- "cho ky" for "choky", "ot al" for 
"otal", etc.
References:
- Curious coincidence
  - From: Jorge Stolfi
Prev by Date: I apolagize...
Next by Date: RE: Curious coincidence
Previous by thread: Re: Curious coincidence
Next by thread: Re: Curious coincidence
Index(es):
- Date
- Thread