[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: VMs: Overfitting the Data (WAS: Another method different from Cardano Grilles)
Hi everyone,
Regardless of whether you assume that the fundamental letter-like
components of Voynichese are (a) letters/graphs encoded in a single
alphabet, or (b) fragments in a set of complex tables (Gordon Rugg), or (c)
fragments in a set of interlocking circles (Francis A. Violat Bordonau)
[etc], you are effectively reducing the stream to a highly constrained set
of base states - but then the key question becomes, how was the producer
able to choose between those base states?
While it is true that (for example) three rings of EVA digraphs/trigraphs
would likely generate a dictionary of many familiar words (like EVA
"ot-ol-al"), why does the instance count of the VMs' actual dictionary have
such a non-flat distribution? And what could account for the distribution
of word-lengths as observed?
To my eyes, what Rugg & Violat Bordonau are doing is trying to build simple
information gadgets to emulate simple Markov models of the VMs: perhaps
this (with a few tweaks) is the kind of technique Gordon relies on (at
least in part) to build his smart VMs-like tables. In the past, I (and no
doubt countless others) have built up explicit state transition models to
try to understand this, based on the probabilities of letters following
other letters. Indeed, others have probably done much the same (implicitly)
with Hidden Markov Models, etc - but the trick here doubtless lies in
identifying the correct set of states. :-o
http://www.nickpelling.pwp.blueyonder.co.uk/voynich/fourcolumns.gif
Yet even so, this isn't really very helpful, as it too fails to produce a
flat distribution of probabilities at each decision point - and so falls
squarely into the same kind of problem mentioned above (ie, "nice set of
states, but how do you drive it?").
What I would find particularly interesting is if someone were independently
to formalise a set of models (such as my four columns, or
crust/mantle/core, etc) and evaluate (1) the degree to which they
"generate" actual Voynichese words (across all pages), and (2) the
probability of each decision point's branch, based on memoried and
memoryless branching. Not so much a science fair project, as probably a
good CompSci term paper, I guess. :-) Such an approach should be able to
assess (for example) whether the hidden states (the red boxes on my model)
are a help or a hindrance, and (perhaps by trying lots of similar models)
to improve on it.
IMHO, using Markov models in this kind of way is a far more honest way of
researching this problem, in that they don't set out from the shaky ground
of presumed historical answers. Furthermore, if it turns out that we are
able to infer the internal structure in this way, we'll then almost
certainly be able to work out the method of production as a result - but
not as an input.
Finally: as a note of caution, the more complex you make your underlying
system (like Rugg's models of ever-increasing, errrrm, "Byzantinity"), the
"predictive residue" diminishes as quickly as the quality of the conceptual
match - until the asymptotic point where it ends up containing the very
data it is trying to mimic. Sounds familiar? D'oh! :-o
Cheers, .....Nick Pelling.....
______________________________________________________________________
To unsubscribe, send mail to majordomo@xxxxxxxxxxx with a body saying:
unsubscribe vms-list