[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Yet more bean-counting: [aoy]
Introduction
------------
In previous notes about the structure of Voynichese words, I have
been ignoring the `circle' letters O = { a o y }. This note looks at the
distribution of the O-letters within the words.
The word paradigm
-----------------
As you may recall, my Voynichese word paradigm (ignoring circle
letters) has the form
Q?1 D?1 X?2 M?1 X?2 R?2
where the notation A?n means from zero to n instances of A,
and
Q = { q }
M = { k t p f } (the `gallows'), possibly preceded by "I" or "c" and/or
followed by "h" and/or "e".
X = { ch sh ee } (the `benches'), possibly followed by one "e"
R = D + F
D = { d l r s x v } (the `dealers'), possibly preceded by "i"s and/or
followed by "e"
F = { n m g j } (the `finals'), also possibly preceded by "i"s and/or
followed by "e"
I will use the term `element' to mean any of these letters with
the attached [iceh] modifiers.
The paradigm implies that a word has a three-layer structure, with a
`core' of gallows elements, a `mantle' of benches, and a `crust' of
dealers and finals. Any layer may be empty, but if present it must be
a contiguous substring of elements, adjacent to or surrounding the
deeper layers. In particular the paradigm forbids words with more than
one M-letter, or two X- or M-letters separated by an R letter.
(Beware that my notation and nomenclature has been changing through
these notes. Sorry for the confusion, but these are *working*
notes...)
Circles are not doubled
-----------------------
Implicit in the paradigm is the rule that circle letters can only be
inserted before or after an element, not within it (e.g. not between
a "k" and its modifying "e"). Thus a word with N elements has N+1
`slots' where the circle letters could be inserted.
There are about 51972 O-letters in the sample text, and about 109672
possible O-slots between elements. These slots are occupied as follows
0 circle letters: 58207 (53.1%)
1 circle letter: 50819 (46.3%)
2 circle letters: 501 (0.4%)
3 circle letters: 3 (0.0%)
There are no instances of 4 or more circles in a row, except for
the `primeval scream' atop one of the cosmo diagrams. (There are
also 142 anomalous inter-element strings, such as "oe". We will
ignore them for now.)
Note that there is a definite dislike for two or more O-letters in a
row. If there was no restriction about the interleaving of O's and
other elements, then 62% of the slots would be empty, 29% would
contain one circle, 7% (i.e. over 7000) would contain two circles,
1% (over 1000) would have three, 0.1% would have 4, and so on.
Distribution of circles in local context
----------------------------------------
The following table shows the occurences of O-strings according to
the two adjacent elements. The letters { a y } have been mapped to
"o" to make the table shorter. Word boundaries are denoted "#" and
empty O-strings by "_".
Inter-element string
-----------------------------
Context _ o oo ooo other
------- ----- ----- ----- ----- -----
#*# -N/A- 240 18 . 3
M*# 189 2896 32 1 3
X*# 110 4942 51 . 3
R*# 19274 7311 13 . 2
#*R 6402 4748 174 1 33
R*R 838 6592 34 . 6
X*R 5086 4749 69 . 2
M*R 1402 7058 64 1 6
#*X 8899 577 9 . 1
R*X 1755 53 . . .
X*X 1294 37 . . .
M*X 6186 95 . . 1
#*M 3635 10212 25 . 40
R*M 1237 159 5 . 7
X*M 1633 894 5 . 29
M*M 11 114 . . 6
other 166 142 2 . .
TOTAL 58207 50819 501 3 142
(The "other" counts are letter groups such as "oe", "shh", "ich",
detached [ice], etc. which cannot be parsed into the standard set of
elements.)
Note again that, overall, half of the O-slots are empty, and half
are occupied by "o". If the placement of the "o"s were independent
of the context, we should expect to see the same 1:1 ratio between
the first two numbers in each row. We see instead that the contexts
M*X, X*X, R*X, #*X strongly repel O-strings (ratios 65:1, 35:1,
33:1, 15:1, respectively), while X*#, M*#,and R*R strongly attract
them (ratios 1:45, 1:15, and 1:8, respectively).
These numbers suggest that an O letter is either word-final, or a
modifier for the following R or M letter (but not X letter). Indeed,
of the 50819 instances of isolated "o", 49675 instances (97.7%) are
in one of these contexts. However, this cannot be taken as an axiom,
because, of of the 18905 O-slots that are followed by an X element,
771 (4%) are filled --- a percentage which is too high to ignore. So
the truth must be more complicated than that.
Location of circles in the word paradigm
----------------------------------------
Let's say that a word is `hard' if it has a non-empty core and/or
mantle, and `soft' otherwise.
In a hard word we can isolate a maximal `prefix' and a maximal
`suffix' consisting of non-core, non-mantle letters --- namely,
dealers, finals, circles, and any [ie] modifiers. Thus, for example,
the hard word "orckhocheody" can be split into prefix "or", suffix
"ody", and core-mantle "ckhoche".
Note that a prefix, suffix, or soft word with N non-circle elements
has N+1 slots where circles could be inserted, while a core-mantle
with N non-circle elements has N-1 such slots. The following table
shows the counts of empty and occupied circle slots in the three
parts of hard words.
soft words: 22435 O-slots, 9952 occupied (44%)
prefixes: 29078 O-slots, 12082 occupied (42%)
suffixes: 46322 O-slots, 27572 occupied (60%)
core-mantles: 11133 O-slots, 1534 occupied (14%)
Thus we see that the O-letters stronly avoid the interior of
core-mantles. In fact, if we look closely, we find that most of the
filled O-slots in core-mantles are combinations "Xo" that precedes
the core, as in "chokedy" or "shchotchy"; or in `invalid'
core-mantles (with more than one M, and/or with R intrusions).
Here are the numbers:
valid core-mantles with O-slots: 9023
without O-insertions: 8076 (89.5%)
with "Xo" before core: 778 (8.6%)
with "y" insertions: 80 (0.8%)
with other O-insertions: 89 (0.9%)
invalid core-mantles with O-slots: 655
without O-insertions: 109 (16.6%)
with "Xo" before core: 94 (14.3%)
with "y" intrusions: 52 (7.9%)
with other O-insertions: 400 (61.1%)
Note that "y" is almost always word-initial or word-final, so an
intra-word "y" is probably the result of omitted word space.
So the 89 valid coremantles with other O-insertions may well be
due to the same cause.
Moreover, the enhanced frequency of "y" inside invalid core-mantles
suggests that these too are the result of joined words. So the
400 invalid core-mantles with other O-insertions are not significant.
In short, the circles are found mostly in the `crust' of words,
except for some 800 instances of "cho" and "sho" sequences in the
first half of the mantle.
Relationship between O- and R-letters
-------------------------------------
Let's look more closely at the interleaving of O and R letters in
the crust of words. That means about 8800 soft (crust-only) words,
as well as the prefixes and suffixes of about 26,000 hard words.
First, let's classify those strings according to the number of
R's and the number of O's:
SOFT WORDS
O-letters in word
-----------------------------------------
R-letters in word 0 1 2 3 4 Total
-------------------- ----- ----- ----- ----- ----- -----
0 R-letters - 240 18 . . 258
1 R-letter 475 3000 387 5 . 3867
2 R-letters 62 3113 936 63 3 4177
3 R-letters 7 63 283 55 7 415
4 R-letters 1 4 24 6 1 36
5 R-letters . . . 1 . 1
Total 545 6420 1648 130 11 8754
Rel. percent 6.2% 73.3% 18.8% 1.5% 0.1% 100.0%
Abs. percent 1.6% 18.3% 4.7% 0.4% 0.0% 24.9%
Average number of R-letters: 1.56
Average number of O-letters: 1.15
PREFIXES
O-letters in prefix
----------------------------------
R-letters in prefix 0 1 2 3 Total
-------------------- ----- ----- ----- ----- -----
0 R-letters 12534 10789 34 . 23357
1 R-letter 1546 1035 30 . 2611
2 R-letters 10 134 13 1 158
3 R-letters 1 . 1 . 2
Total 14091 11958 78 1 26128
Rel. percent 53.9% 45.8% 0.3% 0.0% 100.0%
Abs. percent 40.4% 34.3% 0.2% 0.0% 74.9%
Average number of R-letters: 0.11
Average number of O-letters: 0.46
SUFFIXES
O-letters in suffix
-----------------------------------------
R-letters in suffix 0 1 2 3 4 Total
-------------------- ----- ----- ----- ----- ----- -----
0 R-letters 299 7838 83 1 . 8221
1 R-letter 641 13857 1377 10 1 15886
2 R-letters 29 853 894 69 2 1847
3 R-letters 5 10 73 29 2 119
4 R-letters . . 1 . 1 2
Total 974 22558 2428 109 6 26075
Rel.percent 3.7% 86.5% 9.3% 0.4% 0.0% 100.0%
Abs. percent 2.8% 64.7% 7.0% 0.3% 0.0% 74.8%
Average number of R-letters: 0.76
Average number of O-letters: 1.06
(The absolute percentages are relative to the total number of words
in the text. These counts do not include those soft words, prefixes,
and suffixes --- about 120 of each -- that contain invalid elements
such as "shh", "oq", unattached "i" or "e", etc.. Hence the
discrepancy between the totals for prefixes and suffixes.)
Here are the counts (total and in major sections) of individual
crust patterns, with the R-letters mapped ot "R" and the O-letters
mapped to "o" (so, for example, "daiin" becomes "RoR", and "doaro"
becomes "RooRo"):
SOFT WORDS
tot pha.2 hea.1 cos.2 zod.1 heb.1 str.2 bio.1 pattern
------ ------ ------ ------ ------ ------ ------ ------ -------
3030 136 821 129 34 223 579 605 RoR
2605 87 186 123 97 230 842 563 oR
722 34 101 24 17 70 239 98 oRoR
475 15 149 17 17 34 46 45 R
395 18 141 21 10 41 32 80 Ro
240 3 48 21 15 16 40 38 o
226 9 24 12 9 26 43 67 oRo
187 7 25 17 2 18 55 21 RoRoR
155 10 38 3 . 11 52 7 ooR
144 3 37 10 2 15 27 28 RoRo
62 . 8 4 . 6 20 12 RR
54 4 9 2 2 8 16 6 oRR
51 2 6 4 3 4 16 7 oRoRo
47 3 5 4 2 7 7 13 oRRo
47 3 13 2 . 11 4 . oRRoR
31 2 2 . 1 1 14 6 RRoR
30 2 15 . 1 5 1 1 RoRR
30 . 6 1 . 5 5 7 RoRRo
29 . 5 1 . 3 8 10 RRo
21 1 5 2 1 2 1 3 RoRoRo
20 2 4 3 . . 9 1 RooR
19 2 . . 1 1 7 3 oRoRoR
18 . 6 1 . 2 6 1 oo
15 . 4 1 . 2 3 . RoRRoR
14 1 4 . . 1 6 . oRoRR
9 1 2 . . 3 1 1 oRoRRo
7 . . 1 . 1 3 1 RRR
7 . . . . . 2 1 RoRoRR
7 1 1 1 . 1 1 . ooRoR
6 . 2 1 . . 1 . Roo
6 1 . . . . 1 . oRoRoRo
4 . . . . . 2 1 RRoRo
4 . 1 1 . 1 1 . RoRoRoR
3 . . . 1 1 1 . oRoo
2 . . . . . 2 . RRRo
2 . . . . . 1 1 RRoRoR
2 . 1 . . . 1 . RoRRR
2 . . . 1 . . . RoRoRRo
2 . 1 . . . 1 . RooRo
2 . . . . 1 . 1 oRRoRo
2 . . . . 1 1 . oRooR
2 . . . 1 . 1 . ooRR
2 1 . . . . 1 . ooRRoR
2 . . . 1 . . . ooRo
1 . . . . . 1 . RRRR
1 . . . . . . 1 RRRoR
1 . 1 . . . . . RRoRR
1 . . . 1 . . . RRoo
1 . 1 . . . . . RoRRoRoR
1 . . . 1 . . . RoRoRoRo
1 . . . . . 1 . RoRooR
1 . . . . . 1 . oRRRo
1 1 . . . . . . oRoRoo
1 1 . . . . . . oRooRo
1 . 1 . . . . . ooRRo
1 . 1 . . . . . ooRRoRo
1 . . . . . 1 . ooRoRR
1 . 1 . . . . . oooRoR
------ ------ ------ ------ ------ ------ ------ ------ -------
8754 350 1675 406 220 751 2103 1629 Total
PREFIXES
tot pha.2 hea.1 cos.2 zod.1 heb.1 str.2 bio.1 pattern
------ ------ ------ ------ ------ ------ ------ ------ -------
12534 490 3081 420 207 1048 3365 2003 -
10789 383 1575 443 254 871 3559 2103 o-
1546 20 222 34 3 59 679 384 R-
882 27 46 13 12 58 265 314 oR-
153 4 50 10 1 14 32 26 Ro-
128 . 10 5 2 5 26 59 RoR-
34 3 19 1 . 2 3 3 oo-
23 2 4 2 . 2 4 6 oRo-
10 . 1 . . . 6 2 RR-
9 . . 3 . 1 2 1 oRoR-
6 1 1 1 . . 1 2 oRR-
4 1 . . . . 1 . RoRo-
4 . 2 . . 1 1 . Roo-
3 1 1 . . . 1 . ooR-
1 . 1 . . . . . RRR-
1 . . . . . 1 . RoRoR-
1 . . . . . 1 . oRoRo-
------ ------ ------ ------ ------ ------ ------ ------ -------
SUFFIXES
tot pha.2 hea.1 cos.2 zod.1 heb.1 str.2 bio.1 pattern
------ ------ ------ ------ ------ ------ ------ ------ -------
9112 420 2306 271 146 557 2533 1319 -oR
7838 340 1884 356 164 501 2223 1349 -o
4745 10 26 47 29 583 1769 1859 -Ro
1258 81 241 98 42 130 302 46 -oRo
749 . 15 14 15 88 398 100 -RoR
741 11 96 49 23 64 277 133 -R
726 33 170 35 35 38 219 29 -oRoR
299 12 78 20 2 45 58 26 -
141 8 49 5 3 14 10 21 -oRRo
117 5 30 13 5 6 40 1 -ooR
83 9 32 7 . 5 16 . -oo
82 7 32 2 4 2 14 4 -oRR
64 1 22 2 7 5 14 1 -oRoRo
34 1 10 3 2 4 4 2 -oRRoR
29 . 4 2 . 2 15 3 -RR
24 1 2 3 1 7 7 1 -RoRo
22 . . 2 1 2 7 8 -RRo
21 . 4 3 . 2 9 1 -oRoRoR
20 . . 1 . 3 11 3 -RoRoR
10 . 5 . . . 1 1 -oRoRR
10 . 1 1 . 2 3 . -ooRo
7 . 1 . . . 4 1 -RoRRo
5 . 1 . . 1 3 . -RRR
4 . . 1 . . 2 . -RoRR
4 . . 1 . 2 1 . -oRRR
3 . . . . . 1 2 -RooR
3 1 . . . . 1 . -oRRoRo
3 . 1 1 . . . . -oRoRRo
3 . 1 . . . . . -oRooR
2 . . . . . 1 1 -RRoR
2 . . . . . 2 . -RoRoRo
2 . 1 . . . 1 . -Roo
2 . 1 . . . 1 . -ooRoR
1 . 1 . . . . . -RRoRRo
1 . . . . 1 . . -RRoRo
1 . . . . . . 1 -oRRRo
1 . . . . . 1 . -oRoRoRo
1 . . . . . 1 . -oRoRoRoR
1 . . . . . 1 . -oRoRooR
1 . 1 . . . . . -oRooRo
1 . . . . . 1 . -ooRoRo
1 . . . . . . 1 -ooo
1 . . . . . 1 . -oooRo
------ ------ ------ ------ ------ ------ ------ ------ -------
We can see that consecutive R's and consecutive O's are rare, but
not enough to be classed as errors:
soft words with RR = 409 (4.7% of soft words)
prefixes with RR = 17 (0.1% of non-empty prefixes)
suffixes with RR = 349 (1.4% of non-empty suffixes)
soft words with OO = 227 (2.7% of soft words)
prefixes with OO = 41 (0.3% of non-empty prefixes)
suffixes with OO = 225 (0.9% of non-empty suffixes)
Words with consecutive RRRs and OOOs are extremely rare.
These low counts show that the R-letters, like the O-letters, are
not randomly distributed --- they tend to alternate with the O's.
This alternation is not simply a consequence of
mutual repulsion between the O's. Compare for instance the following
entries from the soft word table:
tot pha.2 hea.1 cos.2 zod.1 heb.1 str.2 bio.1 pattern
------ ------ ------ ------ ------ ------ ------ ------ -------
3030 136 821 129 34 223 579 605 RoR
54 4 9 2 2 8 16 6 oRR
29 . 5 1 . 3 8 10 RRo
722 34 101 24 17 70 239 98 oRoR
144 3 37 10 2 15 27 28 RoRo
47 3 5 4 2 7 7 13 oRRo
If avoidance of OO was the only force acting here, then the
frequencies of "oRR" and "RRo" should be similar to those of "RoR".
Ditto for "oRoR", "RoRo", and "oRRo".
Note that this alternation of R-letters and O-letters confirms that
the two classes are qualitatively distinct.
Well, enough for now....
All the best,
--stolfi