> I've thought for a while now that a good explanation for this might be a > combination of (a) an abbreviating private shorthand (which I suspect would > approach a kind of binomial distribution as the sample-length goes up, but > probably with a shorter word-length "peak"), and (b) a verbose cipher (to > move the peak sideways, ie to the right again). > > Unfortunately, I know of no sample statistics for (a)-like shorthand texts,

There must be examples of modern shorthand texts, although I don't know of any in convenient form. I suppose there are Unicode fonts for Gregg, etc.

Modern shorthand is quite different from the kind I have in mind here (which would be a loose mixture of late-medieval Tironian-like ad hoc abbreviations & Radcliff's drop-letters-you-don't-need system (as mentioned on-list)), so the stats would be quite different. :-(

Also, what about Japanese texts with differing degrees of kana-richness?

While modern Japanese does a fair amount of aggressive abbreviating (like "seku-hara" for "sexual harassment"), the size of (even any one of!) its alphabets puts it in quite a different kind of bracket from the size of the alphabet we see in the VMs (even with a verbose cipher!)

