[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: VMs: One simple question
> The word is rcheodalor. Now my reason for asking is that I believe
> this word is not a possible word in the vms. It follows the rules,
> as far as I can tell, but the two halves (rcheo) and (dalor) I
> believe to be illegal combinations. Please don't ask why I think
> this I have enough trouble working it out myself. :-)
The easy answer is that we do not know what "Voynichese" is,
so we cannot tell whether "rcheodalor" is valid Voynichese or not.
We cannot assume that the VMS used all possible words of Voynichese,
not even those which would be fairly common in other contexts. For
instance, H.G.Well's "War of the Worlds" has about a hundred
occurrences of "brother", but not a single one of "brothers".
One could try to answer the question nonetheless by finding a simple
model (meaning one that is much shorter than simply the list of all
VMS words) that would fit all the words in the VMS, but still exclude a
large fraction of all glyph strings. To illustrate the idea,
imagine a Martian who got a list of Roman numerals, without
any context. After some anaysys he may discover that no "word"
of that list has an "I" before a "C", or more than one "V", etc.
Then he may risk saying that "XICXX" and "LVVI" are illegal "words".
Unfortunately, this approach does not work so well for natural
languages, because most of their words are very rare and will
not show up at all in a VMS-size sample. So if you do not see a
give pattern, you never know whether it is really forbidden, or just
too rare to register.
Voynichese has a Zipf-like word frequency distribution, so it too
suffers from that problem. There are fairly simple word models (such
as my word grammar,
http://www.ic.unicamp.br/~stolfi/voynich/00-06-07-word-grammar/txt.n.html)
that fit 95% of the VMS words. By relaxing them a bit, or adding
exception lists, we could easily cover 100% of the words. (I don't
think it is worth trying, because we know that some 5-10% of the
tokens in our transcriptions are wrong anyway.) But still we cannot
say that a word that doesn't fit such a model is invalid --- it could
just be rare, or even common in general but rare in that context (like
"brothers" in WotW).
If a Martian used the same methods to build a model for English words,
based only on the WotW text, he might well get one that excludes
"brothers" but accepts "foots". Indeed, from that sample he will
probably conclude that an English word cannot contain more than four
"e"s, nor the string "shh". But of course these patterns are not
invalid, only very rare. Now consider the rule "no word has more than
one gallows letter": IIRC it holds for over 95% of the tokens in the
VMS. Are the exceptions transcription errors, or rare but valid words?
Back to rcheodalor: That words does fit my grammar, and by using
mostly common rules:
0.96296 Word -> NormalWord
0.74680 NormalWord -> CrustPrefix.MantleCore.CrustSuffix
0.81803 CrustPrefix -> CrP
0.10177 CrP -> OR
0.45312 OR -> R
0.20410 R -> "r"
0.33279 MantleCore -> WholeMantle
0.98109 WholeMantle -> MtS
0.35207 MtS -> OCH.OE
0.95997 OCH -> CH
0.70158 CH -> "ch"
0.98506 OE -> "e"
1.00000 CrustSuffix -> CrS.OptOFinal
0.00277 CrS -> OR.OR.OR
0.54054 OR -> O.R
0.62448 O -> "o"
0.40034 R -> "d"
0.54054 OR -> O.R
0.36006 O -> "a"
0.31763 R -> "l"
0.54054 OR -> O.R
0.62448 O -> "o"
0.20410 R -> "r"
0.36468 OptOFinal -> ""
According to my grammar, the only strange thing is
the presence of three "OR" groups in the crust suffix.
Namely the alternative CrS -> OR.OR.OR is taken by only
70 tokens, out of 25262 tokens which include a "CrS" part.
But that is a lot more than the frequency of "feeblenesses"
in English, or "interviewera" in French. And in any case
that is *my* model -- in other models, that word may not be
so strange at all.
All the best,
--stolfi
______________________________________________________________________
To unsubscribe, send mail to majordomo@xxxxxxxxxxx with a body saying:
unsubscribe vms-list