[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Curious coincidence
> [John Grove:] Coincidence, I think. For example, there are no
> dkaiin or kdain words (I think) which should happen in a 50-50
> context over the whole document.
But that is the point: the 50-50 split is remarkable precisely because
the distribution of words is *not at all* like that of random strings.
Indeed, by various measures (Rene, Gabriel, Mark...), the word
distribution shows strong variations at all scales --- section, page,
and paragraph. So it is remarkable that the distribution of the
"gallows bit" (which is one of the most striking features of
Voynichese words, both visually and statistically) averages
out so close to 50-50.
It may be just coincidence, as you say. But it is intriguing that the
"codebook" theory predicts this sort of thing. Namely, the
distribution of whole word-codes would be just as irregular and
variable as the distribution of words in the original language; but,
if the codebook itself is random, then the word codes in any text will
be equally split between odd and even, just like a string of ideal
coin tosses. In fact, the codebook doesn't need to be realy random,
merely independent from the text: even an alphabetized dictionary
should give a 50-50 odd/even split.
"Linguistic" explanations don't seem so promising. Most binary
linguistic attributes -- like voiced/unvoiced, front/back,
capital/lowercase, masc/fem, etc. -- can be directly mapped to
binary physiological or mental variables. Therefore the two choices of the
attribute will have different "costs" or meanings, and this
condition usually implies different distributions.
For instance, a voiced consonant probably takes a trifle longer to
pronounce than an ubvoiced one; it probably consumes more or less lung
air, involves more or less muscles, etc. Add to that all sorts of
spelling and grammar biases that may favor one variant over the other.
Therefore, it is not surprising that voiced/unvoiced pairs have
very skewed frequencies in typical languages. Some quick counts:
Portuguese:
166 p 332 t 61 f
77 b 347 d 91 v
Italian:
94 p 157 t 31 f
17 b 89 d 51 v
English:
62 p 306 t 64 f
65 b 126 d 33 v
(From short texts in the standard orthographies --- which, for It. and
Port., is essentially phonetic, at least with regard to these letters.)
> Also, some of those two-g words are labels where they seem most
> likely to be one word.
That is disputable --- why can't they be two-word labels?
> Although, what you might have found is that the gallows
> represents a shaddah (Is that the right word?) for showing that
> one of the characters in the word is doubled. Okay, so I'm a
> little obsessed with the idea! Also, how does one count a
> split-g word? Is it one or two-g?
I am using the majority-vote reading, so it depends on the transcribers.
I excluded words containing "invalid" characters, including weirdos and
characters where there was no majority consensus. So some of those
split-gallows words were probably not counted.
In any case, they are too rare to register (a couple dozen cases,
perhaps?)
> Out of those 50% that do have one-g, what is the varying
> position of the gallow - First, second, third character of a
> two,three, four letter word...
Good question; see the counts below.
Zero gallows letters:
709 0.04083 x
2764 0.15919 xx
4964 0.28590 xxx
4483 0.25819 xxxx
2995 0.17249 xxxxx
929 0.05350 xxxxxx
422 0.02430 xxxxxxx
86 0.00495 xxxxxxxx
9 0.00052 xxxxxxxxx
2 0.00012 xxxxxxxxxx
One gallows letter:
26 0.00149 @
262 0.01502 @x
36 0.00206 x@
1010 0.05792 @xx
767 0.04398 x@x
87 0.00499 xx@
934 0.05356 @xxx
2402 0.13774 x@xx
877 0.05029 xx@x
19 0.00109 xxx@
683 0.03917 @xxxx
2232 0.12799 x@xxx
2207 0.12656 xx@xx
137 0.00786 xxx@x
3 0.00017 xxxx@
283 0.01623 @xxxxx
1391 0.07976 x@xxxx
2068 0.11858 xx@xxx
155 0.00889 xxx@xx
33 0.00189 xxxx@x
2 0.00011 xxxxx@
156 0.00895 @xxxxxx
209 0.01198 x@xxxxx
910 0.05218 xx@xxxx
112 0.00642 xxx@xxx
25 0.00143 xxxx@xx
5 0.00029 xxxxx@x
58 0.00333 @xxxxxxx
107 0.00614 x@xxxxxx
80 0.00459 xx@xxxxx
39 0.00224 xxx@xxxx
13 0.00075 xxxx@xxx
2 0.00011 xxxxx@xx
1 0.00006 xxxxxx@x
15 0.00086 @xxxxxxxx
36 0.00206 x@xxxxxxx
32 0.00183 xx@xxxxxx
8 0.00046 xxx@xxxxx
3 0.00017 xxxx@xxxx
2 0.00011 xxxxx@xxx
1 0.00006 x@xxxxxxxx
4 0.00023 xx@xxxxxxx
1 0.00006 xxxxxx@xxx
3 0.00017 xx@xxxxxxxx
2 0.00011 xxx@xxxxxxx
1 0.00006 xxxx@xxxxxxx
Two or more gallows letters:
25 0.07669 @xx@x
25 0.07669 @xx@xx
25 0.07669 x@xx@xx
18 0.05521 @x@xx
18 0.05521 @xx@xxx
18 0.05521 x@xx@x
17 0.05215 x@x@xx
16 0.04908 @x@x
13 0.03988 @xxx@x
11 0.03374 xx@xx@x
10 0.03067 @x@xxx
9 0.02761 x@x@x
9 0.02761 x@xx@xxx
7 0.02147 x@xxx@xx
7 0.02147 xx@x@x
6 0.01840 @x@xxxx
6 0.01840 @xxx@xx
6 0.01840 x@x@xxx
6 0.01840 xx@x@xx
5 0.01534 @xx@xxxx
5 0.01534 @xxx@
5 0.01534 xx@xx@xx
5 0.01534 xx@xx@xxx
4 0.01227 @xxx@xxx
3 0.00920 @xxxx@xx
3 0.00920 x@x@xxxx
3 0.00920 x@xxx@x
3 0.00920 x@xxxxx@x
3 0.00920 xx@x@xxx
2 0.00613 @xxxx@xxxx
2 0.00613 x@@xxxx
2 0.00613 x@xx@xxxx
2 0.00613 xx@xxx@xx
2 0.00613 xxx@xx@x
1 0.00307 @@x
1 0.00307 @x@xx@xx
1 0.00307 @xx@
1 0.00307 @xx@x@x
1 0.00307 @xxx@xxxx
1 0.00307 @xxx@xxxxx
1 0.00307 @xxxx@x
1 0.00307 @xxxxxx@xxx
1 0.00307 x@@x
1 0.00307 x@xx@
1 0.00307 x@xx@xx@xx
1 0.00307 x@xxx@xxx
1 0.00307 x@xxx@xxxx
1 0.00307 x@xxxx@
1 0.00307 xx@@x
1 0.00307 xx@x@
1 0.00307 xx@xx@
1 0.00307 xx@xxxx@
1 0.00307 xxx@xx@
1 0.00307 xxx@xx@xxxx
1 0.00307 xxxx@x@xx
1 0.00307 xxxx@xx@
1 0.00307 xxxx@xx@xxx
1 0.00307 xxxx@xx@xxxx
1 0.00307 xxxxx@xx@
Here each "@" is a gallows (simple or platformed), and each "x" is one
non-gallows letter (counting "ch", "sh", and "ee" as a single letter).
The input is the whole VMS text, including circular and radial lines,
minus labels and key-like sequences.
I can easily redo the table it with different criteria, if you prefer.
However, I think that this is not the "right" question to ask. I am
convinced that the Voynichese "code" makes liberal use of compound
"letters", and/or optional pre- and post- letter modifiers --- like
the Roman number system, or typical spelling systems. So, if the
logical "position" of the gallows is an important parameter, almost
certainly it is *not* measured by counting EVA or Currier letters.
(I *must* finish that report...)
> Come to think of it, are there any two character words with a
> Gallows (ty, ky)?
Very few:
12 0.00034 ot
5 0.00014 ok
4 0.00011 yk
2 0.00006 op
1 0.00003 lk
1 0.00003 lt
1 0.00003 of
23 0.00065 ky
17 0.00048 ty
6 0.00017 ko
2 0.00006 kl
2 0.00006 py
2 0.00006 to
1 0.00003 ka
1 0.00003 ke
1 0.00003 tl
Again, the corpus is the text minus labels, and the fractions
are relative to the total token count (~34,000).
I bet that many (all?) of these short words are the result of
transcription errors --- "cho ky" for "choky", "ot al" for
"otal", etc.