Hi everyone,

At 14:54 08/02/2004 -0500, Bruce Grant wrote:
Just as you grab a math reference book to look up a certain integral or the value of a function, it would be nice to have a sort of "data book" of raw VM statistics that could be referred to easily.

Examples of the types of statistics that might be included are:
   - letter frequencies by page/by type of page/ by language
   - references to all occurrences of repeated words or phrases.
   - statistics on occurrences of gallows characters
   - a word frequency list
   - a KWIC index (concordance) of the VM
   - word ending frequencies

Then, for example, rather than referring in general to "triple repetitions of words" it would be easy to examine all the actual occurrences to look for some pattern.

One problem with this is that there is still a good deal of uncertainty over what constitutes both a glyph and an encoded token, so it may be better to do this using a real-time (rather than a static) resource. So... we might consider doing much of this "live" using JavaScript. I've already built a (reasonably funky) Voynich transcription analysis page on this basic theme (if you haven't seen it already):

Perhaps I (or someone else) might refine this to emit a load of other statistics, but offer the option (probably via PHP) of saving particular statistical runs onto the server, and giving you a URL to that search to share with others (if you did this on the command-line, as in test.html?transcription=H&pairs=qo.ee.dy.or.ol&quires=all&... etc, you'd probably run out of space). Just a thought. :-o

By the way, to do this it would be useful to choose one (or more) transcription(s) of the VM as a sort of "reference text", which would be included in the databook and used as the basis for all statistics, with the understanding that there are legitimate differences in opinion which would have some effect on the resulting statistics. (Without "putting a stake in the ground" somewhere, however, it is impossible to do more than talk in generalities.)

Bear in mind that few of the transcriptions are complete, and that opinions on individual characters differ widely (especially on hardy perennials like o/a/y & d/m/y etc). Also, many lines in the interlinear relate solely to people's transcriptions of a small group of pages, and so merging between transcriptions might introduce multiple renderings of the same patterns (on different pages etc) - how to make these choices consistent? Overall, not an easy task.

Then, as new theories arise, additional statistics to investigate them could be generated and added to the "data book".

This might point to a more dynamic (and customisable) solution being appropriate: what would the word count be if I changed all occurrences of "oi" into "ai"? etc etc

Cheers, .....Nick Pelling.....

