[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Add. stats for char position
Claus Anders writes:
> As addition, I wanted to know the prob. for each char azt pos. x:
> Pos char -> falling proba.
> 1 o-c-q-s-d-a
> 2 h-o-a-k-t-c
> 3 e-k-h-o-a-c
> 4 e-i-o-a-d-y
> 5 y-i-e-d-h-n
> 6 y-i-d-n-e
> 7 y-n-i-d
> 8 y-n-i-d
> 9 y-n-i
> 10 n-y-i
> 11 n-y
>From what I reacll, Antoine Casanova's doctoral thesis was about
this kind of analysis.
I believe that the various Voynichese word paradigms (including my word
grammar) are another way of analyzing and describing the same
phenomenon -- namely, that as one scans a word from left to right one
sees some definite, non-repeating "evolution" of the code / spelling /
number / monkey / whatever is going on in there.
The word paradigms and grammars are in a sense more precise than the
position-dependent probabilistic model above. Consider for example the
lexicon obtained by concatenating any of {h,ch,sch} with any of
{a,ai,aio}, with each combination equally likely. The probabilistic
model will say that, on words of length 4, each of positions 1-3 can
be "h" with probability 1/3, and each of positions 2-4 can be "a" also
with probability 1/3. One might conclude that the word "haha" has
probability 1/81, but in fact it is invalid (probability 0).
On the other hand, even my probabilistic word grammar does not give
the whole picture, because it assumes implicitly that the choices made
at different slots are independent, which is almost certainly not the
case. In the example above, suppose we exclude the single word "sha"
from the lexicon, with the other 8 combinations still equally likely.
A natural choice for the probabilistic grammar would still be
Word -> Left.Right
Left -> h(3/8) | sh(2/8) | sch(3/8)
Right -> a(2/8) | ai(3/8) | aio(3/8)
However, while the probabilities for each slot are correct,
the word probabilities that result from taking *independent*
choices in each slot are not correct:
word predicted prob actual prob
------- ---------------- ------------
ha 6/64 8/64
sha 4/64 0/64
scha 6/64 8/64
hai 9/64 8/64
shai 6/64 8/64
schai 9/64 8/64
haio 9/64 8/64
shaio 6/64 8/64
schaio 9/64 8/64
This observed dependency on character position within the word could
mean several things. If the text is plain language, it seems to imply
either that individual words are syllables, as in many East Asian
languages, or that different alphabets/spelling rules are used for the
beginning, middle, and end of words, as in Arabic. (Other data,
however, seem to exclude the latter and reinforce the former.)
If the text is a letter-based cipher, that obseravtion suggests an
automaton-like code whis is reset (at least in part) at each word, or
that spaces are generated by the encoding. (However the label data
seems to exclude this possibility, and the Zipf plots suggest that in
fact the code is completely reset at each new word.)
Finally, if the text is a codebook-based cipher, that observation
suggests a Roman-style number system, where different "digit
positions" use different sets of "digits". In particular,
words like "lkechy" (with an anomalous "l" before the gallows) may
be analogous to the "subtractive" Roman numerals, where
an I (units) digit may appear before an X (tens).
All the best,
--stolfi