[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: LSC sums for monkey texts



Hello Mark,

> Rene, I have looked at your curves and noticed the following features:
> the Se curves (calculated) are exactly as we obtained, so in regard to Se
> your program seems to work the same way. 2) If I understand it correctly
> (and if not you correct me) what you call 1st order monkey is actually a
> random permutation of letters of the original text.

It is not significantly different from that. It is a computer-generated
text with a single-character frequency equal to that of the source text.
All characters are generated independently from each other (i.e. a
permutation _with_ replacement.)
 
> Indeed, the Sm LSC sum looks like those we obtained for such
> permutations

> 3) The higher order monkeys are (if I understood it correctly) results
> of random permutations of n-tuples of letters.

Again: almost. Taking the case of the 3rd order process, what the Monkey
program does is making a table of all character triplets in the source
text. The computer-generated text is generated character by character,
where the probability of each new character depends on the two preceding
ones, and it follows the distribution of all triplets in the source
text with the same pair of initial characters.

> The fourth order monkey is then
> somehow similar to our texts obtained by random permutations of words.

Due to the fact that the source text is really *much* too short to
use a 4th order monkey properly, this text will indeed tend to exist
of small chunks from the source text all mixed up.
I must look again at your tests for texts with the words mixed up.
I would expect such a text to be 'nearer to meaningful' than a
4th order monkey text.

> Indeed, the Sm curves for 4-order monkey looks rather similar to our
> word-shuffled texts. 4) What is puzzling is the Sm curve for your
> original Latin text.  It is like our typical Sm curves for meaningful
> texts (including Genesis in Latin) at small n, but is rather different at
> large n. For all meaningful texts we obtained a well expressed growth of
> Sm at n exceeding that for well formed PMP.  In your example PMP seems to
> be not well formed and there is no typical rise of Sm toward large n. In
> order to find the reason for that, I'll email to tomorrow you some texts
> we used (including VMS-A and VMS-B). If you conduct LSC text on them
> using your program we'll be able to see if you obtain the same curves we
> did or your program works differently.

Yes. 
I will email you the text I used, and also the table of Sm and Se
values resulting from it. I suspect that the text length plays a 
major role. The jitter for higher values of 'n' in several of the
graphs makes me think that the text may have been a bit short.
Today I just ran one case: an English text of about 700,000 characters,
and the Sm curve was very smooth and went up to over 4*Se for n=50,000

> I would like to say that our program was tested and retested
> very meticulously and we are confident it measures OK.  So,
> either you encountered a Latin text which is peculiar
> in regard to LSC, or something is wrong with the program.

I do not doubt for a moment that your program is reliable,
which is why I would like to try mine on your source texts and
compare with the numbers in your articles.

More later,
        Rene