[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: VMs: Noise or data ?



I believe Philip Neale has looked into this question, and hit problems
with finding a mechanism for one-way encryption which will produce the
linguistic features of Voynichese (especially with regard to final "m").

It was Prescott Currier in the 1970s who first showed that the frequency of certain words and certain letters of Voynichese is dependent on their position in the line: it was he who drew attention to final "m".

What I have said (and I think I was one of the first people to say it) is
that the monotonous internal structure of the words can be simplified if
you assume that they contain unwritten blanks: for instance
qokey
qokeedy
okeedy
opedy

etc can be seen as representing underlying forms

qo_k_e___y
qo_k_E__dy
_o_k_E__dy
_o_p_e__dy

etc. I have suggested various underlying word grammars at different times: not all Voynichese
words fit them but I can claim to account for about 90% of word tokens on these principles.


If that was all there is to it, I think it would be very easy to generate Voynichese stochastically
on the lines Gordon has been trying. You would simply need what is called a regular expression,
a sort of flow chart in which at each point you select one character or none from a set
of choices specific to that point.


The trouble is the existence of other constraints on the placing of letters within the word and the Currier results about differential frequencies at different points in the line. It seems to me that some of these
cannot be explained *by regular expressions*. A stochastic explanation may still be possible, but it
would involve a more complicated kind of state space (word grammar, line grammar) which I think we
still have not got. Gordon has suggested various possibilities in off-list communications to me.


If anyone has put together a list of features along these lines, it would
be very interesting to see them, and might help identify fruitful areas
for further research.

The ones known to me are these (I am not claiming a general priority here, many of these have
been known for years):


pktf are sometimes in free variation with q at the beginning of a word, but this is more frequently
the case where the word in question is the first word of a line, and nearly obligatory where it is
the first word in a paragraph. The first word in a paragraph often contains two characters from
this set, far more often than words elsewhere in a paragraph.


y, d, s are sometimes in free variation with q at the beginning of a word, and more frequently so
when it is the first word in a line, but *not* when it is the first word in a paragraph.


forms such as shedy, chedy, shey, chey (which I analyse as ____Se__dy, ____Ce__dy etc) are
most frequently found as the second or third words in a line, seldom as the first word.


ktpf are in free variation with each other after initial qo, qol, qor, o, ol, or. k and t are more frequent
than p and f. Normally, k is more frequent than t, but as Rene Zandbergen pointed out on the list
a year or two ago, there are continuous sections of text where t is more frequent than k.


the sequences ke, te are common but the sequences pe, fe are rare (even allowing for the fact
that p and f are less common than k and t)


final p and f occur very occasionally: where they do, it is usually in the middle of the first line
of a paragraph


the sequences el, er, eel, eer are rare

the sequences an, ain, al, ar, ol, or are common, but on and oin are disproportionately infrequent.

am is in free variation with an, ain, al, ar at the end of a word, but this is more frequently the case
where the word in question is the last or nearly the last in a line.


s is in free variation with y at the end of a word (e.g oteey, otees, qokey, qokes). This more
commonly occurs after ee than after e: final s is common in some parts of the manuscript (though
never more common that final y) and uncommon in other sections.


isolated words like the star labels seldom begin with q

sequences of three consecutive tokens of the same word (eg qokeey qokeey qokeey) occur
more often than you would expect in natural language.


triplets of three consecutive tokens of three different words have fewer repeated occurrences
than you would expect in natural language (e.g. there are no triplets like 'and of the' which occur
together again and again in English text).


Observations like these cannot be explained purely in terms of a regular expression: they involve
what linguists call dependencies which are usually (in connection with natural languages)
analysed using tree structures or bracketed lists. Which brings us back to the phenomenon of
the line and paragraph as a structural unit. It was Currier who noticed this, but neither he nor
anyone since has explained why this should be so.


Philip Neal

_________________________________________________________________
On the move? Get Hotmail on your mobile phone http://www.msn.co.uk/msnmobile

______________________________________________________________________
To unsubscribe, send mail to majordomo@xxxxxxxxxxx with a body saying:
unsubscribe vms-list