[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: VMs: One simple question

To: vms-list@xxxxxxxxxxx
Subject: Re: VMs: One simple question
From: Jorge Stolfi <stolfi@xxxxxxxxxxxxx>
Date: Sat, 18 Sep 2004 09:57:36 -0300
Reply-to: vms-list@xxxxxxxxxxx
Sender: owner-vms-list@xxxxxxxxxxx

  > The word is rcheodalor. Now my reason for asking is that I believe
  > this word is not a possible word in the vms. It follows the rules,
  > as far as I can tell, but the two halves (rcheo) and (dalor) I
  > believe to be illegal combinations. Please don't ask why I think
  > this I have enough trouble working it out myself. :-)

The easy answer is that we do not know what "Voynichese" is,
so we cannot tell whether "rcheodalor" is valid Voynichese or not.

We cannot assume that the VMS used all possible words of Voynichese,
not even those which would be fairly common in other contexts. For
instance, H.G.Well's "War of the Worlds" has about a hundred
occurrences of "brother", but not a single one of "brothers".

One could try to answer the question nonetheless by finding a simple
model (meaning one that is much shorter than simply the list of all
VMS words) that would fit all the words in the VMS, but still exclude a
large fraction of all glyph strings. To illustrate the idea,
imagine a Martian who got a list of Roman numerals, without
any context. After some anaysys he may discover that no "word"
of that list has an "I" before a "C", or more than one "V", etc.
Then he may risk saying that "XICXX" and "LVVI" are illegal "words".

Unfortunately, this approach does not work so well for natural 
languages, because most of their words are very rare and will
not show up at all in a VMS-size sample.  So if you do not see a 
give pattern, you never know whether it is really forbidden, or just 
too rare to register. 

Voynichese has a Zipf-like word frequency distribution, so it too
suffers from that problem. There are fairly simple word models (such
as my word grammar,
http://www.ic.unicamp.br/~stolfi/voynich/00-06-07-word-grammar/txt.n.html)
that fit 95% of the VMS words. By relaxing them a bit, or adding
exception lists, we could easily cover 100% of the words. (I don't
think it is worth trying, because we know that some 5-10% of the
tokens in our transcriptions are wrong anyway.) But still we cannot
say that a word that doesn't fit such a model is invalid --- it could
just be rare, or even common in general but rare in that context (like
"brothers" in WotW).

If a Martian used the same methods to build a model for English words,
based only on the WotW text, he might well get one that excludes
"brothers" but accepts "foots". Indeed, from that sample he will
probably conclude that an English word cannot contain more than four
"e"s, nor the string "shh". But of course these patterns are not
invalid, only very rare. Now consider the rule "no word has more than
one gallows letter": IIRC it holds for over 95% of the tokens in the
VMS. Are the exceptions transcription errors, or rare but valid words?

Back to rcheodalor: That words does fit my grammar, and by using
mostly common rules:
   
  0.96296 Word -> NormalWord
  0.74680   NormalWord -> CrustPrefix.MantleCore.CrustSuffix
  0.81803     CrustPrefix -> CrP
  0.10177       CrP -> OR
  0.45312         OR -> R
  0.20410           R -> "r"
  0.33279     MantleCore -> WholeMantle
  0.98109       WholeMantle -> MtS
  0.35207         MtS -> OCH.OE
  0.95997           OCH -> CH
  0.70158             CH -> "ch"
  0.98506           OE -> "e"
  1.00000     CrustSuffix -> CrS.OptOFinal
  0.00277       CrS -> OR.OR.OR
  0.54054         OR -> O.R
  0.62448           O -> "o"
  0.40034           R -> "d"
  0.54054         OR -> O.R
  0.36006           O -> "a"
  0.31763           R -> "l"
  0.54054         OR -> O.R
  0.62448           O -> "o"
  0.20410           R -> "r"
  0.36468       OptOFinal -> ""
  
According to my grammar, the only strange thing is 
the presence of three "OR" groups in the crust suffix.
Namely the alternative CrS -> OR.OR.OR is taken by only
70 tokens, out of 25262 tokens which include a "CrS" part.
But that is a lot more than the frequency of "feeblenesses"
in English, or "interviewera" in French. And in any case
that is *my* model -- in other models, that word may not be 
so strange at all.

All the best,

--stolfi
______________________________________________________________________
To unsubscribe, send mail to majordomo@xxxxxxxxxxx with a body saying:
unsubscribe vms-list

Follow-Ups:
- Re: VMs: One simple question
  - From: Koontz John E
- Re: VMs: One simple question
  - From: ajb

Prev by Date: Re: VMs: Number crunching the Fincher window
Next by Date: Re: AW: VMs: Character repetition
Previous by thread: VMs: "Folly follows the script"
Next by thread: Re: VMs: One simple question
Index(es):
- Date
- Thread