[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: AW: VMs: Character repetition
On Saturday 18 September 2004 22:28, Koontz John E wrote:
> Note that I only meant "coincidence" in the context of the VMs. In
> Chaucer, or whatever, you know what the words are, even with the spaces
> removed, and we're all reasonably sure that Chaucer didn't cleverly
> include some parallel message encoded in units of modal length n equal to
> the modal length of the words in the text and cleverly coincident with the
> English text.
Possible, but against this is (as posted by Stolfi recently) the fact that
vms-words show similar statistics (such as length) to the vms labels
(5 vs 6, keeping in mind the difference in sample size).
That is why I mentioned getting the model token length from
the spaces or from the line-length distribution.
I also guess that if another character was the token separator and we are to
believe that the peak in the vms is due to the same reason as in other
languages, then the frequency of the token separator should be similar to
that of the space.
There are about 39147 spaces. The nearest one in frequency terms is <o> with
25522, then <e> with 20551.
If that was the case, then the correlation peak would not coincide
with the modal token length (not that this is necessary, of course)
and I also wonder if the segmented text would follow Zipf's law.
To see how likely the <o> is to be a token separator:
<f1r.1> fachysykalarataiinsh.lsh.rycthresyk.rsh.ldy
<f1r.2> s.ryckhar.rykairchtaiinshararectharcthardan
<f1r.3> syaiirsheky.rykaiinsh.dcth.arycthesdaraiinsa
<f1r.4> .'.iin.teey.te.sr.l.tycth*ardaiin.taiin.r.kan
<f1r.5> sairychearcthaiincpharcfhaiinydaraishy
where <.> is <o>. So this generates extremely long tokens and also
introduces the problem of the line initial <o>.
Quite a few labels start with <o>. This should be unnecessary.
> But in the case of the VMs, if we don't appeal to other evidence like
> Stolfi's "morphosyntaxes" - real enough evidence of course - then we can't
> be sure that the observed spacing isn't conceivably inserted at the same
> modal lengths as the "real token" but in different places.
Of course we cannot dismiss that possibility. But we can ask ourselves if
that would be something that could have been planned so accurately. I really
doubt it.
> Here's an experiment. Suppose one created a set of arbitrary rules for
> generating character sequences, e.g., something as simple as all words
> consisting of ba*c?, with some known limits on length, or something more
> complex, on the order of the what Stolfi has deduced, and used it to
> generate an arbitrary sequence of tokens. I hypothesize that analyzed
> with the spectral technique such a string of tokens would produce similar
> results to the VMs and Chaucer, modulo particular modes. And I
> hypothesize that the results would also be independent of token order, but
> not of character order. This would be analogous to generating numbers
> from periodic functions adding some random noise and analyzing it. If we
> suffled the random numbers the periods would still be there.
Yes it would, but I do not understand why it makes the results any different.
As mentioned earlier this is due (I think) to the word construction rules and
their frequencies. Whether there is any meaning is a completely different
matter (I think there is large scale structure that shows up as long range
correlations; this is dealt with below).
I posted some time ago a way (using a dictionary) to generate vms-like text
that is fully meaningfull and has the same entropy of the vms, plus zipf's
law. Again, the spectral analysis of this text would reveal the same results
as the vms and I doubt that anybody would be able to crack it. (I made an
offer to send a text encoded this way and nobody has taken the challenge yet,
even with the advantage that I can assure that this one is meaningful.)
Funnily, according to Rugg, this would be a meaningless text (it has a large
number of statistical properties common to the vms), but I know it isn't
gibberish so his conclusions are incorrect.
> If this is true, and I think it would prove to be, what this tells us - or
> rather, I think, reminds us - is that the spectral technique can recognize
> patterning in the form of tokens, but that it can't guarantee that this
> patterning has any functional load.
What you wrote above is correct in terms of the "modal-token" peak, but
there is more in the correlation plots. The long range correlations would
tell you if there is any large scale structure in the stream.
Figure 8a in my Cryptol paper shows that these long range correlations
(the negative slope of the left part of Fig 8a) disappear with token
(and character) scrambling. So my conclusion on this was that there is some
structural part of the text which is destroyed with the token scrambling.
If the vms text was a random collection of tokens, these long range
correlations would not exist in the first place.
You may ask "what on Earth do we mean by 'structure'?". I do not
think that there is a consensus, other than there is some non-random
large scale (1000-10000) ordering of the characters. Some suggestions for
what may be governing this 'structure' are: sentences, paragraph, chapter &
subject matter, but I am sure that there may be others, like human bias to
avoid the same word in consecutive sentences, or clustering of certain words
in some parts of the corpus and so on.
> Note that I think that Rugg hasn't managed that. He has pointed to the
> repetition of words, but most of his actual work has been at the level of
> finding a plausible and/or easy way to generate suitably patterned tokens.
Exactly! He is only doing reverse-engineering at the token level.
If he wants to generate large scale structure I am sure that there are some
convoluted ways to do it, but this is in effect moving the goal posts all the
time. The more requirements, the less believable the whole thing becomes.
Interestingly he has not produced any long chunks of texts, so there is no way
to test large scale structure either. But even if we tested it and there was
no large scale structure, I fear that his argument is likely to go in the
same direction of his intent to explain the labels and weirdoes: you get
another special table for that. (!)
To me it seems that the claims are not supported by any data (at least in his
paper there isn't any quantitative result), not even at the token level. It
is pretty much qualitative stuff.
> He's shown that he can generate
> tokens - word like objects - not that these aren't being used to encode
> something.
Sure, and as Jacques put it very nicely in his recent review, being able to
generate Chinese-like gibberish does not mean that Chinese literature is
gibberish.
> This refers to polyalphabetic substitutions that blur the frequencies of
> letter occurrence?
Yes.
Sorry for the long post.
Regards,
Gabriel
______________________________________________________________________
To unsubscribe, send mail to majordomo@xxxxxxxxxxx with a body saying:
unsubscribe vms-list