(Obviously any of these assumptions may be defeated if the text is not
based on a phonological representation of a typical human language.)
I point this out because one of the difficulties of collocational (such as
Mark has done) or syntactic analysis (notably, Jorge Stolfi, of course) on
the VMs is that unless we can guarantee that we can properly distinguish
letters and words our analyses may involved uncertain or mingled levels.
For example, in the present context, we can't tell if a set reflects
perhaps a set of copulas coming between subjects and predicates, or a set
of common prefixes, or even vowels (between consonants). Hypothetically a
collocation might even reflect a combination of phrase initial words,
prefixes, and word-initial letters, if the VMs is cleverly enough encoded.
Alternatively, in discussing difficulties with taking the "letter-space"
separated elements (EVA characters) as letters in the past I've pointed
out that if we don't know if the EVA letters are the actual
letter-elements, then a grammar of them might mingle canonical form and
morphology.
Still, assuming that we are dealing with a phonetic text and more or less
natural language, then if the VMs words represent something other than
words per se, they would probably still be more or less ordered. For
generality we might want to allow that the perceived EVA characters and
perceived word divisions represent variably more or less than letters or
word, but are nevertheless ordered. Or we might want to allow local
reordering (inverted, halves swapped, Pig-Latined, etc.).
In any event, if the VMs encodes a text in some language, then one way or
another we need to start by identifying the letter and word units.
Repeated experiments of the sort Mark and othes report suggest that we're
somehow off a bit in this respect, but right in assuming the ordered text.
A question that occurs to me is whether all VMs words can be accounted for
in terms of sequences of shorter words. I think someone must have looked
at this.
It occurs to me that letter-frequency lists don't usually list word
separator!