[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: AW: VMs: Character repetition



On Saturday 18 September 2004 22:28, Koontz John E wrote:
> Note that I only meant "coincidence" in the context of the VMs.  In
> Chaucer, or whatever, you know what the words are, even with the spaces
> removed, and we're all reasonably sure that Chaucer didn't cleverly
> include some parallel message encoded in units of modal length n equal to
> the modal length of the words in the text and cleverly coincident with the
> English text.

Possible, but against this is (as posted by Stolfi recently) the fact that 
vms-words show similar statistics (such as length) to the vms labels 
(5 vs 6, keeping in mind the difference in sample size). 
That is why I mentioned getting the model token length from 
the spaces or from the line-length distribution.

I also guess that if another character was the token separator and we are to 
believe that the peak in the vms is due to the same reason as in other 
languages, then the frequency of the token separator should be similar to 
that of the space.
There are about 39147 spaces. The nearest one in frequency terms is  <o> with 
25522, then <e> with 20551.
If that was the case, then the correlation peak would not coincide
with the modal token length (not that this is necessary, of course) 
and I also wonder if the segmented text would follow Zipf's law.
To see how likely the <o> is to be a token separator:

<f1r.1>        fachysykalarataiinsh.lsh.rycthresyk.rsh.ldy
<f1r.2>        s.ryckhar.rykairchtaiinshararectharcthardan
<f1r.3>        syaiirsheky.rykaiinsh.dcth.arycthesdaraiinsa
<f1r.4>        .'.iin.teey.te.sr.l.tycth*ardaiin.taiin.r.kan
<f1r.5>        sairychearcthaiincpharcfhaiinydaraishy

where <.> is <o>. So this generates extremely long tokens and also 
introduces the problem of the line initial <o>. 
Quite a few labels start with <o>. This should be unnecessary.

> But in the case of the VMs, if we don't appeal to other evidence like
> Stolfi's "morphosyntaxes" - real enough evidence of course - then we can't
> be sure that the observed spacing isn't conceivably inserted at the same
> modal lengths as the "real token" but in different places.

Of course we cannot dismiss that possibility. But we can ask ourselves if 
that would be something that could have been planned so accurately. I really 
doubt it.

> Here's an experiment.  Suppose one created a set of arbitrary rules for
> generating character sequences, e.g., something as simple as all words
> consisting of ba*c?, with some known limits on length, or something more
> complex, on the order of the what Stolfi has deduced, and used it to
> generate an arbitrary sequence of tokens.  I hypothesize that analyzed
> with the spectral technique such a string of tokens would produce similar
> results to the VMs and Chaucer, modulo particular modes.  And I
> hypothesize that the results would also be independent of token order, but
> not of character order.  This would be analogous to generating numbers
> from periodic functions adding some random noise and analyzing it.  If we
> suffled the random numbers the periods would still be there.

Yes it would, but I do not understand why it makes the results any different. 
As mentioned earlier this is due (I think) to the word construction rules and 
their frequencies. Whether there is any meaning is a completely different 
matter (I think there is large scale structure that shows up as long range 
correlations; this is dealt with below).
I posted some time ago a way (using a dictionary) to generate vms-like text 
that is fully meaningfull and has the same entropy of the vms, plus zipf's 
law. Again, the spectral analysis of this text would reveal the same results 
as the vms and I doubt that anybody would be able to crack it. (I made an 
offer to send a text encoded this way and nobody has taken the challenge yet, 
even with the advantage that I can assure that this one is meaningful.)

Funnily, according to Rugg, this would be a meaningless text (it has a large
number of statistical properties common to the vms), but I know it isn't 
gibberish so his conclusions are incorrect.

> If this is true, and I think it would prove to be, what this tells us - or
> rather, I think, reminds us - is that the spectral technique can recognize
> patterning in the form of tokens, but that it can't guarantee that this
> patterning has any functional load.

What you wrote above is correct in terms of the "modal-token" peak, but 
there is more in the correlation plots. The long range correlations would 
tell you if there is any large scale structure in the stream. 
Figure 8a in my Cryptol paper shows that these long range correlations
(the negative slope of the left part of Fig 8a) disappear with token 
(and character) scrambling. So my conclusion on this was that there is some 
structural part of the text which is destroyed with the token scrambling. 
If the vms text was a random collection of tokens, these long range 
correlations would not exist in the first place.

You may ask "what on Earth do we mean by 'structure'?". I do not 
think that there is a consensus, other than there is some non-random 
large scale (1000-10000) ordering of the characters. Some suggestions for 
what may be governing this 'structure' are: sentences, paragraph, chapter & 
subject matter, but I am sure that there may be others, like human bias to 
avoid the same word in consecutive sentences, or clustering of certain words 
in some parts of the corpus and so on.

> Note that I think that Rugg hasn't managed that.  He has pointed to the
> repetition of words, but most of his actual work has been at the level of
> finding a plausible and/or easy way to generate suitably patterned tokens.

Exactly! He is only doing reverse-engineering at the token level. 
If he wants to generate large scale structure I am sure that there are some
convoluted ways to do it, but this is in effect moving the goal posts all the 
time. The more requirements, the less believable the whole thing becomes. 
Interestingly he has not produced any long chunks of texts, so there is no way 
to test large scale structure either. But even if we tested it and there was 
no large scale structure, I fear that his argument is likely to go in the 
same direction of his intent to explain the labels and weirdoes: you get 
another special table for that. (!)
To me it seems that the claims are not supported by any data (at least in his 
paper there isn't any quantitative result), not even at the token level. It 
is pretty much qualitative stuff.

> He's shown that he can generate
> tokens - word like objects - not that these aren't being used to encode
> something.

Sure, and as Jacques put it very nicely in his recent review, being able to
generate Chinese-like gibberish does not mean that Chinese literature is
gibberish.

> This refers to polyalphabetic substitutions that blur the frequencies of
> letter occurrence?

Yes.
Sorry for the long post.

Regards,

Gabriel

______________________________________________________________________
To unsubscribe, send mail to majordomo@xxxxxxxxxxx with a body saying:
unsubscribe vms-list