[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: On the word length distribution




Hi again...

In my last message I described two methods for generating 
random text with the required binomial word-length distribution. 

With the first method (nine coins), the *token* length distribution
would be binomial, too --- which is not what we see.  

The second method (sequential generation of letters, with random
resets) would produce a binomial word distr, and a token distr biased
towards shorter words. Unfortunately, the bias would be excessive:
tokens of length k would be generated with relative probability p**k,
for some p < 1. This is an exponential decay, which is not what we
see. In fact, the VMS token length distribution is still hump-shaped,
with a maximum near 4.5 symbols (just one less than the the word distr
maximum).

So we still lack a plausible mechanism that would generate 
random text with the observed word and token length distributions.

On the other hand, it looks like 
the nomenclator scheme described in the webpage
will produce the required distributions.  Namely,
assign a number to each new word that comes up in the plaintext
(or in some other "practice" text), in sequential 
order; and then encode the numbers in the `bit position'
notation.

Note also that the `bit position' notation is not that exotic. The
Roman, Greek, and Chinese number systems were essentially like that,
except that they were base-10 instead of base-2. The old English
capacity system was both `bit position' and binary:

  pint
  quart
  quart + pint
  pottle
  pottle + pint
  pottle + quart
  pottle + quart + pint
  gallon
  gallon + pint
 
etc.

Now for something weird: I had learned about the English binary
capacity system, many years ago, from Knuth's Art of Computer
Programming (vol.2, sec. 4.1); but I had to look it up again since I
had forgotten the word for 1/2-gallon. While searching for "capacity"
in the index, I ran into a reference to "Caramuel y Lobkowitz, Juan".

Now, this guy was a Spanish bishop, apparently sitting at Naples
(which then was a Spanish posession), who corresponded extensively
with Marci and Kircher about exotic languages and cryptography, and
even wrote Marci's eulogy. Small world, this one....

The weird part is that he is referenced by Knuth in the same section
as the English capacity table, in fact right in the next paragraph.
According to Knuth, the first published description of the binary
number system (and number systems in other bases) was a little-known
work by this fellow.

Is the Millennium coming, or what?  8-)

In fact, now that the thing came up: among the many letters by
Caramuel that I saw in the Carteggio Kircheriano site, I recall one
which did seem to have a list of binary numbers. I will try to find
the URL...

For the record, Knuth's reference for Caramuel's binary number paper
is "Mathesis biceps 1" (Campaniae, 1670), 45-48.

All the best,

--stolfi