[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: AW: VMs: Character repetition



On Sat, 18 Sep 2004, Gabriel Landini wrote:
> Long: There are several features in those correlation plots. I mostly
> concentrated in the short-length correlations. What I am saying :-) is
> that the analysis still recognises fluctuations in symbol occurrences
> that peak at the same length as the tokens (i.e. their mode) in various
> languages. Of course this could be a coincidence.

Note that I only meant "coincidence" in the context of the VMs.  In
Chaucer, or whatever, you know what the words are, even with the spaces
removed, and we're all reasonably sure that Chaucer didn't cleverly
include some parallel message encoded in units of modal length n equal to
the modal length of the words in the text and cleverly coincident with the
English text.

But in the case of the VMs, if we don't appeal to other evidence like
Stolfi's "morphosyntaxes" - real enough evidence of course - then we can't
be sure that the observed spacing isn't conceivably inserted at the same
modal lengths as the "read token" but in different places.  In this sense,
of course, Stolfi's approach and yours tend to reinforce each other, by
providing different kinds of ewidence that the token string observed is
structured independently of the spacing.

I think your two kinds of evidence are different, but not independent, in
that Stolfi is deducing rules for generating the patterns within tokens,
while you are measuring the periodicities that result from the existence
of the patterns.

The independent evidence of the existence of the tokens as significant
units is, of course, the inter-token spacing itself.

> The only way I can see to test this is to create surrogate data, and
> that is precisely what I did: character and token scrambling. The first
> one destroys the effect, the second doesn't. I therefore suggested that
> this peak has to do with word construction+relative frequencies of words
> and not with sentence construction (i.e. the position of the tokens does
> not seem to affect it).

Here's an experiment.  Suppose one created a set of arbitrary rules for
generating character sequences, e.g., something as simple as all words
consisting of ba*c?, with some known limits on length, or something more
complex, on the order of the what Stolfi has deduced, and used it to
generate an arbitrary sequence of tokens.  I hypothesize that analyzed
with the spectral technique such a string of tokens would produce similar
results to the VMs and Chaucer, modulo particular modes.  And I
hypothesize that the results would also be independent of token order, but
not of character order.  This would be analogous to generating numbers
from periodic functions adding some random noise and analyzing it.  If we
suffled the random numbers the periods would still be there.

If this is true, and I think it would prove to be, what this tells us - or
rather, I think, reminds us - is that the spectral technique can recognize
patterning in the form of tokens, but that it can't guarantee that this
patterning has any functional load.  Of course, any set of patterns can
carry a functional load - two different signals and you can encode a
message if you devise rules and line up enough of them - so this shouldn't
lead to any sort of mental crisis among those who dislike considering the
possibility of the VMs being "empty text."

What we do know from the spectral analysis and the syntactic analysis is
that there is some formal (in the sense of formulaic) reality to the
tokens delimited by the inter-token spaces, and that they form a suitable
basis for encoding something.  They aren't noise, and we have some insight
into their structure.  So, those who like can continue to accept the
decode-it-if-you-can challenge the VMs implies.  I personally prefer this
...

There is, I think, one caveat.  We can see that the constituent marks or
graphs and (at a somewhat higher level) glyphs (letter-like associated
sets of marks) are patterned.  The syntactic and spectral analyses and the
inter-token spacing identify patterns at a still higher level - the token
level.  But languages use variations in token form and/or token sequence
to convey messages.  Those who are interested in approaching the problem
of proving or disproving the existence of a message - a functional load -
in the VMs, might still be able to make something of patterning or lack of
it at this level.

Note that I think that Rugg hasn't managed that.  He has pointed to the
repetition of words, but most of his actual work has been at the level of
finding a plausible and/or easy way to generate suitably patterned tokens.
I think what the consideration of token structure recently on this list
has convinced me is that patterning in the form of tokens is potentially
independent of the presence of a message, though necessary for it.  In
effect he's showing that a message is possible, not At all that it's
impossible.  Actually, all he's shown is that it is possible to produce a
degree of token-level patterning randomly.  But random application of any
set of rules for generating the morphological (or phonological) forms of
words for any language will do as much.  He's shown that he can generate
tokens - word like objects - not that these aren't being used to encode
something.  To do that he's going to have to wrestle with the attested set
of tokens and their sequencing, not processes for generating things that
look like them.

I think that others - my apologies that I've lost track of whom! - have
been saying this - more or less - but it took me a while to work it out
for myself.

> [This] could be seen as another bit of hammering against Strong's
> "solution" because those features exist in the vms and become
> increasingly unlikely in polyalphabetic substitutions.

This refers to polyalphabetic substitutions that blur the frequencies of
letter occurrence?  I'm thinking that something like Riddersta's approach
might be a partial exception, because the alphabet identifier is included,
as it were, as a higher order digit.  For example, if one distributed the
table-selectors and row-selectors across the indices instead of factoring
them out, one would have what amounted to a single alphabet with 8
variants of each letter.  I suppose what makes a polyalphabetic
substitution harder to crack is that one has to deduce these higher order
digits' existence and identity.  If the coder elects to include them more
or less explicitly, the effect is lost?


______________________________________________________________________
To unsubscribe, send mail to majordomo@xxxxxxxxxxx with a body saying:
unsubscribe vms-list