[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: A few LSC comments

Rene, I have had a few thoughts in regard to the monkey texts. You have
slightly modified our formula for Se by assuming that the distribution of
text elements (letters, digrams, trigrams, n-grams) is with replacement.
Let me say something about this assumption. I believe we have to
distinguish between four situations, to wit:
1) Texts generated by permutations of the above elements (as it was the
case in our study). In this case there is a limited stock of the above
elements, hence there is a negative correlation between elements▓
distributions in chunks, and therefore it is a case without replacement
(hypergeometeric distribution).  Our formula for Se was derived for that
2) Monkey texts generated by using the probabilities of elements (letters,
digraphs, etc) and also assuming that the stock of those elements is the
same as that available for the original meaningful text.  In this case we
have again negative correlation and it is a no-replacement case
(hypergeometric) so our formula is to be used without a modification.
3) The text generated as in item 2) but assuming the stock of letters is
much-much larger (say 100,000 times larger) than that available in the
original text, preserving though the ratios of elements occurrences as in
the original text.  This is a case with replacement (approximately but with
increasing accuracy as the size of the stock increases). In this case our
formula has to be modified (as indicated in paper 1) using multinomial
variance.  Quantitatively the difference is only in L/(L-1) coefficient
which at L>>1 is negligible.
4) The text generated assuming the stock of elements is unfinitely large.
In this case the distribution of elements is uniform, i.e. the
probabilities of all elements become equal to each other (each equal 1/z
where z is the number of all possible elements (letters, or digrams, etc)
in the original text). In this case formula for Se simplifies (I derived it
in paper 1 for that case as an approximation to roughly estimate Se for
n>1).  Quantitatively cases 1 through 3 are very close, but case 4 produces
quantities measurably (but not very much) differing from cases 1 through 3
(see examples in paper 1).
All of the above has only purely academic interest, but sometimes I am a
stickler, for the sake of some abstract accuracy.  Practically it is
inconsequential unless very short texts are used.  Cheers, Mark