[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

*To*: Rene <Zandbergen@xxxxxxxxxxx>*Subject*: Re: A few LSC comments*From*: Mark Perakh <perakh@xxxxxxxxxxx>*Date*: Wed, 19 Jan 2000 14:15:58 -0800*Cc*: voynich@xxxxxxxx*Delivered-to*: reeds@research.att.com*Organization*: home*References*: <12AJxt-0OiTYmC@fwd03.sul.t-online.de>*Reply-to*: perakh@xxxxxxxxxxx*Sender*: jim@xxxxxxxxxxxxx

Rene, I have had a few thoughts in regard to the monkey texts. You have slightly modified our formula for Se by assuming that the distribution of text elements (letters, digrams, trigrams, n-grams) is with replacement. Let me say something about this assumption. I believe we have to distinguish between four situations, to wit: 1) Texts generated by permutations of the above elements (as it was the case in our study). In this case there is a limited stock of the above elements, hence there is a negative correlation between elements▓ distributions in chunks, and therefore it is a case without replacement (hypergeometeric distribution). Our formula for Se was derived for that situation. 2) Monkey texts generated by using the probabilities of elements (letters, digraphs, etc) and also assuming that the stock of those elements is the same as that available for the original meaningful text. In this case we have again negative correlation and it is a no-replacement case (hypergeometric) so our formula is to be used without a modification. 3) The text generated as in item 2) but assuming the stock of letters is much-much larger (say 100,000 times larger) than that available in the original text, preserving though the ratios of elements occurrences as in the original text. This is a case with replacement (approximately but with increasing accuracy as the size of the stock increases). In this case our formula has to be modified (as indicated in paper 1) using multinomial variance. Quantitatively the difference is only in L/(L-1) coefficient which at L>>1 is negligible. 4) The text generated assuming the stock of elements is unfinitely large. In this case the distribution of elements is uniform, i.e. the probabilities of all elements become equal to each other (each equal 1/z where z is the number of all possible elements (letters, or digrams, etc) in the original text). In this case formula for Se simplifies (I derived it in paper 1 for that case as an approximation to roughly estimate Se for n>1). Quantitatively cases 1 through 3 are very close, but case 4 produces quantities measurably (but not very much) differing from cases 1 through 3 (see examples in paper 1). All of the above has only purely academic interest, but sometimes I am a stickler, for the sake of some abstract accuracy. Practically it is inconsequential unless very short texts are used. Cheers, Mark

**Follow-Ups**:**Reinterpreting the LSC (long)***From:*Jorge Stolfi

**References**:**A few LSC comments***From:*Rene

- Prev by Date:
**Re: doaro** - Next by Date:
**Gif...** - Previous by thread:
**Re: A few LSC comments** - Next by thread:
**Reinterpreting the LSC (long)** - Index(es):