[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Yet more bean-counting: [aoy]

To: voynich@xxxxxxxx
Subject: Yet more bean-counting: [aoy]
From: Jorge Stolfi <stolfi@xxxxxxxxxxxxxx>
Date: Mon, 24 Jan 2000 08:35:55 -0200 (EDT)
Delivered-to: reeds@research.att.com
Reply-to: stolfi@xxxxxxxxxxxxxx
Sender: jim@xxxxxxxxxxxxx
Introduction
------------

  In previous notes about the structure of Voynichese words, I have
  been ignoring the `circle' letters O = { a o y }. This note looks at the
  distribution of the O-letters within the words.

The word paradigm
-----------------

  As you may recall, my Voynichese word paradigm (ignoring circle
  letters) has the form

    Q?1 D?1 X?2 M?1 X?2 R?2 

  where the notation A?n means from zero to n instances of A,
  and

    Q = { q }

    M = { k t p f } (the `gallows'), possibly preceded by "I" or "c" and/or
        followed by "h" and/or "e".

    X = { ch sh ee } (the `benches'), possibly followed by one "e" 

    R  = D + F

    D = { d l r s x v } (the `dealers'), possibly preceded by "i"s and/or
        followed by "e"

    F = { n m g j } (the `finals'), also possibly preceded by "i"s and/or
        followed by "e"

  I will use the term `element' to mean any of these letters with
  the attached [iceh] modifiers. 

  The paradigm implies that a word has a three-layer structure, with a
  `core' of gallows elements, a `mantle' of benches, and a `crust' of
  dealers and finals. Any layer may be empty, but if present it must be
  a contiguous substring of elements, adjacent to or surrounding the
  deeper layers. In particular the paradigm forbids words with more than
  one M-letter, or two X- or M-letters separated by an R letter.

  (Beware that my notation and nomenclature has been changing through
  these notes. Sorry for the confusion, but these are *working*
  notes...)

Circles are not doubled
-----------------------

  Implicit in the paradigm is the rule that circle letters can only be
  inserted before or after an element, not within it (e.g. not between
  a "k" and its modifying "e"). Thus a word with N elements has N+1
  `slots' where the circle letters could be inserted.

  There are about 51972 O-letters in the sample text, and about 109672
  possible O-slots between elements. These slots are occupied as follows
  
    0 circle letters:  58207 (53.1%)
    1 circle letter:   50819 (46.3%)
    2 circle letters:    501  (0.4%)
    3 circle letters:      3  (0.0%)
    
  There are no instances of 4 or more circles in a row, except for 
  the `primeval scream' atop one of the cosmo diagrams. (There are
  also 142 anomalous inter-element strings, such as "oe". We will
  ignore them for now.)
  
  Note that there is a definite dislike for two or more O-letters in a
  row. If there was no restriction about the interleaving of O's and
  other elements, then 62% of the slots would be empty, 29% would
  contain one circle, 7% (i.e. over 7000) would contain two circles,
  1% (over 1000) would have three, 0.1% would have 4, and so on.
  
Distribution of circles in local context
----------------------------------------

  The following table shows the occurences of O-strings according to
  the two adjacent elements. The letters { a y } have been mapped to
  "o" to make the table shorter. Word boundaries are denoted "#" and
  empty O-strings by "_".
  
                Inter-element string
                -----------------------------
      Context       _     o    oo   ooo other
      -------   ----- ----- ----- ----- -----
      #*#       -N/A-   240    18     .     3
      M*#         189  2896    32     1     3
      X*#         110  4942    51     .     3
      R*#       19274  7311    13     .     2

      #*R        6402  4748   174     1    33
      R*R         838  6592    34     .     6
      X*R        5086  4749    69     .     2
      M*R        1402  7058    64     1     6

      #*X        8899   577     9     .     1
      R*X        1755    53     .     .     .
      X*X        1294    37     .     .     .
      M*X        6186    95     .     .     1

      #*M        3635 10212    25     .    40
      R*M        1237   159     5     .     7
      X*M        1633   894     5     .    29
      M*M          11   114     .     .     6

      other       166   142     2     .     .

      TOTAL     58207 50819   501     3   142 

  (The "other" counts are letter groups such as "oe", "shh", "ich",
  detached [ice], etc. which cannot be parsed into the standard set of
  elements.)
  
  Note again that, overall, half of the O-slots are empty, and half
  are occupied by "o". If the placement of the "o"s were independent
  of the context, we should expect to see the same 1:1 ratio between
  the first two numbers in each row. We see instead that the contexts
  M*X, X*X, R*X, #*X strongly repel O-strings (ratios 65:1, 35:1,
  33:1, 15:1, respectively), while X*#, M*#,and R*R strongly attract
  them (ratios 1:45, 1:15, and 1:8, respectively).
  
  These numbers suggest that an O letter is either word-final, or a
  modifier for the following R or M letter (but not X letter). Indeed,
  of the 50819 instances of isolated "o", 49675 instances (97.7%) are
  in one of these contexts. However, this cannot be taken as an axiom,
  because, of of the 18905 O-slots that are followed by an X element,
  771 (4%) are filled --- a percentage which is too high to ignore. So
  the truth must be more complicated than that.
  
Location of circles in the word paradigm 
----------------------------------------

  Let's say that a word is `hard' if it has a non-empty core and/or
  mantle, and `soft' otherwise.
  
  In a hard word we can isolate a maximal `prefix' and a maximal
  `suffix' consisting of non-core, non-mantle letters --- namely,
  dealers, finals, circles, and any [ie] modifiers. Thus, for example,
  the hard word "orckhocheody" can be split into prefix "or", suffix
  "ody", and core-mantle "ckhoche".

  Note that a prefix, suffix, or soft word with N non-circle elements
  has N+1 slots where circles could be inserted, while a core-mantle
  with N non-circle elements has N-1 such slots. The following table
  shows the counts of empty and occupied circle slots in the three
  parts of hard words.

    soft words:    22435 O-slots,  9952 occupied (44%)
    prefixes:      29078 O-slots, 12082 occupied (42%)
    suffixes:      46322 O-slots, 27572 occupied (60%)
    core-mantles:  11133 O-slots,  1534 occupied (14%)

  Thus we see that the O-letters stronly avoid the interior of
  core-mantles. In fact, if we look closely, we find that most of the
  filled O-slots in core-mantles are combinations "Xo" that precedes
  the core, as in "chokedy" or "shchotchy"; or in `invalid'
  core-mantles (with more than one M, and/or with R intrusions).
  Here are the numbers:
  
    valid core-mantles with O-slots: 9023
      without O-insertions:    8076 (89.5%)
      with "Xo" before core:    778  (8.6%)
      with "y" insertions:       80  (0.8%)
      with other O-insertions:   89  (0.9%)

    invalid core-mantles with O-slots: 655
      without O-insertions:     109 (16.6%)
      with "Xo" before core:     94 (14.3%) 
      with "y" intrusions:       52  (7.9%)
      with other O-insertions:  400 (61.1%)
      
  Note that "y" is almost always word-initial or word-final, so an
  intra-word "y" is probably the result of omitted word space.
  So the 89 valid coremantles with other O-insertions may well be
  due to the same cause.
  
  Moreover, the enhanced frequency of "y" inside invalid core-mantles
  suggests that these too are the result of joined words. So the
  400 invalid core-mantles with other O-insertions are not significant. 
    
  In short, the circles are found mostly in the `crust' of words,
  except for some 800 instances of "cho" and "sho" sequences in the
  first half of the mantle. 

Relationship between O- and R-letters
-------------------------------------

  Let's look more closely at the interleaving of O and R letters in
  the crust of words. That means about 8800 soft (crust-only) words,
  as well as the prefixes and suffixes of about 26,000 hard words.
  
  First, let's classify those strings according to the number of
  R's and the number of O's:
  
    
    SOFT WORDS

                                                    O-letters in word
                            -----------------------------------------
      R-letters in word         0      1      2      3      4   Total
      --------------------  -----  -----  -----  -----  -----   -----
      0 R-letters               -    240     18      .      .     258
      1 R-letter              475   3000    387      5      .    3867
      2 R-letters              62   3113    936     63      3    4177
      3 R-letters               7     63    283     55      7     415
      4 R-letters               1      4     24      6      1      36
      5 R-letters               .      .      .      1      .       1

      Total                   545   6420   1648    130     11    8754
      Rel. percent            6.2%  73.3%  18.8%   1.5%   0.1%  100.0%
      Abs. percent            1.6%  18.3%   4.7%   0.4%   0.0%   24.9%
      
      Average number of R-letters: 1.56
      Average number of O-letters: 1.15


    PREFIXES

                                           O-letters in prefix
                            ----------------------------------
      R-letters in prefix       0      1      2      3   Total
      --------------------  -----  -----  -----  -----   -----
      0 R-letters           12534  10789     34      .   23357
      1 R-letter             1546   1035     30      .    2611
      2 R-letters              10    134     13      1     158
      3 R-letters               1      .      1      .       2

      Total                 14091  11958     78      1   26128
      Rel. percent           53.9%  45.8%   0.3%   0.0%  100.0%
      Abs. percent           40.4%  34.3%   0.2%   0.0%   74.9%


      Average number of R-letters: 0.11
      Average number of O-letters: 0.46

    SUFFIXES

                                                  O-letters in suffix 
                            -----------------------------------------
      R-letters in suffix       0      1      2      3      4   Total
      --------------------  -----  -----  -----  -----  -----   -----
      0 R-letters             299   7838     83      1      .    8221   
      1 R-letter              641  13857   1377     10      1   15886   
      2 R-letters              29    853    894     69      2    1847             
      3 R-letters               5     10     73     29      2     119   
      4 R-letters               .      .      1      .      1       2

      Total                   974  22558   2428    109      6   26075             
      Rel.percent             3.7%  86.5%   9.3%   0.4%   0.0%  100.0%
      Abs. percent            2.8%  64.7%   7.0%   0.3%   0.0%   74.8%
 
      Average number of R-letters: 0.76
      Average number of O-letters: 1.06

  (The absolute percentages are relative to the total number of words
  in the text. These counts do not include those soft words, prefixes,
  and suffixes --- about 120 of each -- that contain invalid elements
  such as "shh", "oq", unattached "i" or "e", etc.. Hence the
  discrepancy between the totals for prefixes and suffixes.)

  Here are the counts (total and in major sections) of individual
  crust patterns, with the R-letters mapped ot "R" and the O-letters
  mapped to "o" (so, for example, "daiin" becomes "RoR", and "doaro"
  becomes "RooRo"):

    SOFT WORDS

         tot   pha.2   hea.1   cos.2   zod.1   heb.1   str.2   bio.1  pattern
      ------  ------  ------  ------  ------  ------  ------  ------  -------
        3030     136     821     129      34     223     579     605  RoR
        2605      87     186     123      97     230     842     563  oR
         722      34     101      24      17      70     239      98  oRoR
         475      15     149      17      17      34      46      45  R
         395      18     141      21      10      41      32      80  Ro
         240       3      48      21      15      16      40      38  o
         226       9      24      12       9      26      43      67  oRo
         187       7      25      17       2      18      55      21  RoRoR
         155      10      38       3       .      11      52       7  ooR
         144       3      37      10       2      15      27      28  RoRo
          62       .       8       4       .       6      20      12  RR
          54       4       9       2       2       8      16       6  oRR
          51       2       6       4       3       4      16       7  oRoRo
          47       3       5       4       2       7       7      13  oRRo
          47       3      13       2       .      11       4       .  oRRoR
          31       2       2       .       1       1      14       6  RRoR
          30       2      15       .       1       5       1       1  RoRR
          30       .       6       1       .       5       5       7  RoRRo
          29       .       5       1       .       3       8      10  RRo
          21       1       5       2       1       2       1       3  RoRoRo
          20       2       4       3       .       .       9       1  RooR
          19       2       .       .       1       1       7       3  oRoRoR
          18       .       6       1       .       2       6       1  oo
          15       .       4       1       .       2       3       .  RoRRoR
          14       1       4       .       .       1       6       .  oRoRR
           9       1       2       .       .       3       1       1  oRoRRo
           7       .       .       1       .       1       3       1  RRR
           7       .       .       .       .       .       2       1  RoRoRR
           7       1       1       1       .       1       1       .  ooRoR
           6       .       2       1       .       .       1       .  Roo
           6       1       .       .       .       .       1       .  oRoRoRo
           4       .       .       .       .       .       2       1  RRoRo
           4       .       1       1       .       1       1       .  RoRoRoR
           3       .       .       .       1       1       1       .  oRoo
           2       .       .       .       .       .       2       .  RRRo
           2       .       .       .       .       .       1       1  RRoRoR
           2       .       1       .       .       .       1       .  RoRRR
           2       .       .       .       1       .       .       .  RoRoRRo
           2       .       1       .       .       .       1       .  RooRo
           2       .       .       .       .       1       .       1  oRRoRo
           2       .       .       .       .       1       1       .  oRooR
           2       .       .       .       1       .       1       .  ooRR
           2       1       .       .       .       .       1       .  ooRRoR
           2       .       .       .       1       .       .       .  ooRo
           1       .       .       .       .       .       1       .  RRRR
           1       .       .       .       .       .       .       1  RRRoR
           1       .       1       .       .       .       .       .  RRoRR
           1       .       .       .       1       .       .       .  RRoo
           1       .       1       .       .       .       .       .  RoRRoRoR
           1       .       .       .       1       .       .       .  RoRoRoRo
           1       .       .       .       .       .       1       .  RoRooR
           1       .       .       .       .       .       1       .  oRRRo
           1       1       .       .       .       .       .       .  oRoRoo
           1       1       .       .       .       .       .       .  oRooRo
           1       .       1       .       .       .       .       .  ooRRo
           1       .       1       .       .       .       .       .  ooRRoRo
           1       .       .       .       .       .       1       .  ooRoRR
           1       .       1       .       .       .       .       .  oooRoR
      ------  ------  ------  ------  ------  ------  ------  ------  -------
        8754     350    1675     406     220     751    2103    1629  Total

    PREFIXES

         tot   pha.2   hea.1   cos.2   zod.1   heb.1   str.2   bio.1  pattern
      ------  ------  ------  ------  ------  ------  ------  ------  -------
       12534     490    3081     420     207    1048    3365    2003  -
       10789     383    1575     443     254     871    3559    2103  o-
        1546      20     222      34       3      59     679     384  R-
         882      27      46      13      12      58     265     314  oR-
         153       4      50      10       1      14      32      26  Ro-
         128       .      10       5       2       5      26      59  RoR-
          34       3      19       1       .       2       3       3  oo-
          23       2       4       2       .       2       4       6  oRo-
          10       .       1       .       .       .       6       2  RR-
           9       .       .       3       .       1       2       1  oRoR-
           6       1       1       1       .       .       1       2  oRR-
           4       1       .       .       .       .       1       .  RoRo-
           4       .       2       .       .       1       1       .  Roo-
           3       1       1       .       .       .       1       .  ooR-
           1       .       1       .       .       .       .       .  RRR-
           1       .       .       .       .       .       1       .  RoRoR-
           1       .       .       .       .       .       1       .  oRoRo-
      ------  ------  ------  ------  ------  ------  ------  ------  -------

    SUFFIXES

         tot   pha.2   hea.1   cos.2   zod.1   heb.1   str.2   bio.1  pattern
      ------  ------  ------  ------  ------  ------  ------  ------  -------
        9112     420    2306     271     146     557    2533    1319  -oR
        7838     340    1884     356     164     501    2223    1349  -o
        4745      10      26      47      29     583    1769    1859  -Ro
        1258      81     241      98      42     130     302      46  -oRo
         749       .      15      14      15      88     398     100  -RoR
         741      11      96      49      23      64     277     133  -R
         726      33     170      35      35      38     219      29  -oRoR
         299      12      78      20       2      45      58      26  -
         141       8      49       5       3      14      10      21  -oRRo
         117       5      30      13       5       6      40       1  -ooR
          83       9      32       7       .       5      16       .  -oo
          82       7      32       2       4       2      14       4  -oRR
          64       1      22       2       7       5      14       1  -oRoRo
          34       1      10       3       2       4       4       2  -oRRoR
          29       .       4       2       .       2      15       3  -RR
          24       1       2       3       1       7       7       1  -RoRo
          22       .       .       2       1       2       7       8  -RRo
          21       .       4       3       .       2       9       1  -oRoRoR
          20       .       .       1       .       3      11       3  -RoRoR
          10       .       5       .       .       .       1       1  -oRoRR
          10       .       1       1       .       2       3       .  -ooRo
           7       .       1       .       .       .       4       1  -RoRRo
           5       .       1       .       .       1       3       .  -RRR
           4       .       .       1       .       .       2       .  -RoRR
           4       .       .       1       .       2       1       .  -oRRR
           3       .       .       .       .       .       1       2  -RooR
           3       1       .       .       .       .       1       .  -oRRoRo
           3       .       1       1       .       .       .       .  -oRoRRo
           3       .       1       .       .       .       .       .  -oRooR
           2       .       .       .       .       .       1       1  -RRoR
           2       .       .       .       .       .       2       .  -RoRoRo
           2       .       1       .       .       .       1       .  -Roo
           2       .       1       .       .       .       1       .  -ooRoR
           1       .       1       .       .       .       .       .  -RRoRRo
           1       .       .       .       .       1       .       .  -RRoRo
           1       .       .       .       .       .       .       1  -oRRRo
           1       .       .       .       .       .       1       .  -oRoRoRo
           1       .       .       .       .       .       1       .  -oRoRoRoR
           1       .       .       .       .       .       1       .  -oRoRooR
           1       .       1       .       .       .       .       .  -oRooRo
           1       .       .       .       .       .       1       .  -ooRoRo
           1       .       .       .       .       .       .       1  -ooo
           1       .       .       .       .       .       1       .  -oooRo
      ------  ------  ------  ------  ------  ------  ------  ------  -------

  We can see that consecutive R's and consecutive O's are rare, but
  not enough to be classed as errors:
  
    soft words with RR =  409 (4.7% of soft words)
    prefixes with RR =     17 (0.1% of non-empty prefixes)
    suffixes with RR =    349 (1.4% of non-empty suffixes)
    
    soft words with OO =  227 (2.7% of soft words)
    prefixes with OO =     41 (0.3% of non-empty prefixes)
    suffixes with OO =    225 (0.9% of non-empty suffixes)
  
  Words with consecutive RRRs and OOOs are extremely rare.

  These low counts show that the R-letters, like the O-letters, are
  not randomly distributed --- they tend to alternate with the O's.
  This alternation is not simply a consequence of
  mutual repulsion between the O's. Compare for instance the following
  entries from the soft word table:
  
         tot   pha.2   hea.1   cos.2   zod.1   heb.1   str.2   bio.1  pattern
      ------  ------  ------  ------  ------  ------  ------  ------  -------
        3030     136     821     129      34     223     579     605  RoR
          54       4       9       2       2       8      16       6  oRR
          29       .       5       1       .       3       8      10  RRo

         722      34     101      24      17      70     239      98  oRoR
         144       3      37      10       2      15      27      28  RoRo
          47       3       5       4       2       7       7      13  oRRo
  
  If avoidance of OO was the only force acting here, then the
  frequencies of "oRR" and "RRo" should be similar to those of "RoR".
  Ditto for "oRoR", "RoRo", and "oRRo".

  Note that this alternation of R-letters and O-letters confirms that 
  the two classes are qualitatively distinct.
  
Well, enough for now....

All the best,

--stolfi
Prev by Date: Re: LSC and the VMS
Next by Date: Re: LSC and the VMS
Previous by thread: Re: LSC and the VMS
Next by thread: Tibetan
Index(es):
- Date
- Thread