[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
RE: VMs: VMS Word context similarities
On Thu, 8 Sep 2005, Marke Fincher wrote:
> Something similar happens with the VMS. When the threshold is set high,
> randomisation reduces the number of relationships found, when the
> threshold it set low, it increases the number.
We expect two or three levels of organization in a text.
- We expect the ordering imposed by the syntax of the language.
- We expect that letters (or really the underlying sounds) are ordered in
within a word to reflect the canonical form of words in the language,
e.g., very simply, CVCV, not a random selection of n tokens from the set
of letters.
- We expect an intermediate level of ordering reflecting the morphology of
words, e.g., morphemes recur in different words (e.g., con-form, con-tact,
in-tact, uni-form, etc.) and their occurrence is governed by the
morphosyntax of the language.
(Obviously any of these assumptions may be defeated if the text is not
based on a phonological representation of a typical human language.)
I point this out because one of the difficulties of collocational (such as
Mark has done) or syntactic analysis (notably, Jorge Stolfi, of course) on
the VMs is that unless we can guarantee that we can properly distinguish
letters and words our analyses may involved uncertain or mingled levels.
For example, in the present context, we can't tell if a set reflects
perhaps a set of copulas coming between subjects and predicates, or a set
of common prefixes, or even vowels (between consonants). Hypothetically a
collocation might even reflect a combination of phrase initial words,
prefixes, and word-initial letters, if the VMs is cleverly enough encoded.
Alternatively, in discussing difficulties with taking the "letter-space"
separated elements (EVA characters) as letters in the past I've pointed
out that if we don't know if the EVA letters are the actual
letter-elements, then a grammar of them might mingle canonical form and
morphology.
Still, assuming that we are dealing with a phonetic text and more or less
natural language, then if the VMs words represent something other than
words per se, they would probably still be more or less ordered. For
generality we might want to allow that the perceived EVA characters and
perceived word divisions represent variably more or less than letters or
word, but are nevertheless ordered. Or we might want to allow local
reordering (inverted, halves swapped, Pig-Latined, etc.).
In any event, if the VMs encodes a text in some language, then one way or
another we need to start by identifying the letter and word units.
Repeated experiments of the sort Mark and othes report suggest that we're
somehow off a bit in this respect, but right in assuming the ordered text.
A question that occurs to me is whether all VMs words can be accounted for
in terms of sequences of shorter words. I think someone must have looked
at this.
It occurs to me that letter-frequency lists don't usually list word
separator!
______________________________________________________________________
To unsubscribe, send mail to majordomo@xxxxxxxxxxx with a body saying:
unsubscribe vms-list