[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

*To*: voynich@xxxxxxxx*Subject*: RE: On the word length distribution*From*: Jorge Stolfi <stolfi@xxxxxxxxxxxxx>*Date*: Tue, 26 Dec 2000 13:22:17 -0200 (EDT)*Delivered-to*: reeds@research.att.com*Reply-to*: stolfi@xxxxxxxxxxxxx*Sender*: jim@xxxxxxxxxxxxx

Hi again... In my last message I described two methods for generating random text with the required binomial word-length distribution. With the first method (nine coins), the *token* length distribution would be binomial, too --- which is not what we see. The second method (sequential generation of letters, with random resets) would produce a binomial word distr, and a token distr biased towards shorter words. Unfortunately, the bias would be excessive: tokens of length k would be generated with relative probability p**k, for some p < 1. This is an exponential decay, which is not what we see. In fact, the VMS token length distribution is still hump-shaped, with a maximum near 4.5 symbols (just one less than the the word distr maximum). So we still lack a plausible mechanism that would generate random text with the observed word and token length distributions. On the other hand, it looks like the nomenclator scheme described in the webpage will produce the required distributions. Namely, assign a number to each new word that comes up in the plaintext (or in some other "practice" text), in sequential order; and then encode the numbers in the `bit position' notation. Note also that the `bit position' notation is not that exotic. The Roman, Greek, and Chinese number systems were essentially like that, except that they were base-10 instead of base-2. The old English capacity system was both `bit position' and binary: pint quart quart + pint pottle pottle + pint pottle + quart pottle + quart + pint gallon gallon + pint etc. Now for something weird: I had learned about the English binary capacity system, many years ago, from Knuth's Art of Computer Programming (vol.2, sec. 4.1); but I had to look it up again since I had forgotten the word for 1/2-gallon. While searching for "capacity" in the index, I ran into a reference to "Caramuel y Lobkowitz, Juan". Now, this guy was a Spanish bishop, apparently sitting at Naples (which then was a Spanish posession), who corresponded extensively with Marci and Kircher about exotic languages and cryptography, and even wrote Marci's eulogy. Small world, this one.... The weird part is that he is referenced by Knuth in the same section as the English capacity table, in fact right in the next paragraph. According to Knuth, the first published description of the binary number system (and number systems in other bases) was a little-known work by this fellow. Is the Millennium coming, or what? 8-) In fact, now that the thing came up: among the many letters by Caramuel that I saw in the Carteggio Kircheriano site, I recall one which did seem to have a list of binary numbers. I will try to find the URL... For the record, Knuth's reference for Caramuel's binary number paper is "Mathesis biceps 1" (Campaniae, 1670), 45-48. All the best, --stolfi

- Prev by Date:
**Christian symbols** - Next by Date:
**Re: Christian symbols** - Previous by thread:
**RE: On the word length distribution** - Next by thread:
**Re: On the word length distribution** - Index(es):