[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: VMs: Overfitting the Data (WAS: Another method different from Cardano Grilles)



Hi everyone,

Regardless of whether you assume that the fundamental letter-like components of Voynichese are (a) letters/graphs encoded in a single alphabet, or (b) fragments in a set of complex tables (Gordon Rugg), or (c) fragments in a set of interlocking circles (Francis A. Violat Bordonau) [etc], you are effectively reducing the stream to a highly constrained set of base states - but then the key question becomes, how was the producer able to choose between those base states?

While it is true that (for example) three rings of EVA digraphs/trigraphs would likely generate a dictionary of many familiar words (like EVA "ot-ol-al"), why does the instance count of the VMs' actual dictionary have such a non-flat distribution? And what could account for the distribution of word-lengths as observed?

To my eyes, what Rugg & Violat Bordonau are doing is trying to build simple information gadgets to emulate simple Markov models of the VMs: perhaps this (with a few tweaks) is the kind of technique Gordon relies on (at least in part) to build his smart VMs-like tables. In the past, I (and no doubt countless others) have built up explicit state transition models to try to understand this, based on the probabilities of letters following other letters. Indeed, others have probably done much the same (implicitly) with Hidden Markov Models, etc - but the trick here doubtless lies in identifying the correct set of states. :-o
http://www.nickpelling.pwp.blueyonder.co.uk/voynich/fourcolumns.gif


Yet even so, this isn't really very helpful, as it too fails to produce a flat distribution of probabilities at each decision point - and so falls squarely into the same kind of problem mentioned above (ie, "nice set of states, but how do you drive it?").

What I would find particularly interesting is if someone were independently to formalise a set of models (such as my four columns, or crust/mantle/core, etc) and evaluate (1) the degree to which they "generate" actual Voynichese words (across all pages), and (2) the probability of each decision point's branch, based on memoried and memoryless branching. Not so much a science fair project, as probably a good CompSci term paper, I guess. :-) Such an approach should be able to assess (for example) whether the hidden states (the red boxes on my model) are a help or a hindrance, and (perhaps by trying lots of similar models) to improve on it.

IMHO, using Markov models in this kind of way is a far more honest way of researching this problem, in that they don't set out from the shaky ground of presumed historical answers. Furthermore, if it turns out that we are able to infer the internal structure in this way, we'll then almost certainly be able to work out the method of production as a result - but not as an input.

Finally: as a note of caution, the more complex you make your underlying system (like Rugg's models of ever-increasing, errrrm, "Byzantinity"), the "predictive residue" diminishes as quickly as the quality of the conceptual match - until the asymptotic point where it ends up containing the very data it is trying to mimic. Sounds familiar? D'oh! :-o

Cheers, .....Nick Pelling.....


______________________________________________________________________ To unsubscribe, send mail to majordomo@xxxxxxxxxxx with a body saying: unsubscribe vms-list