[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: On the word length distribution



[ DJL, this is a corrected version of the reply I sent you minutes
  ago. --stolfi ]

Well, on second thoughts, the binomial distribution of word lengths is
a bit less remarkable than what I thought. It will be observed in any
code or spelling system that has the following properties:

  (1) each word has nine distinguished slots;
  
  (2) each slot can be either empty, or filled with one
      different symbol;
      
  (3) all possible choices in (2) result in distinct words.
  
  (4) all possible choices in (2) do occur in the text.
  
Note that we need no assumptions on probabilities,
only on possibilities.

It is not hard to invent nomenclator codes, like the 
example I posted, that obey these rules.  An invented
language with a `logical' vocabulary could also fit.

However, these rules could perhaps be satisfied also by a natural
language with monosyllabic words. The Chinese syllable, for instance,
has six slots --- main consonant, up to 3 vowels, final consonant,
tone --- and, with a suitable spelling system, all of them can be
empty.

Granted, in most natural languages of that kind, each slot can be
filled with one of several distinct symbols, and this variation would
break the binomial distribution. However, if the inventor of the
writing system was a mathematician, he could have chosen to denote
each multi-choice symbol by a combination of several single-choice
slots, just for the sake of symmetry.

Rule (3) too would require some tweaking of the spelling system.
Rule (4) is problematic also, but it seems that monosyllabic languages
actually come pretty close to fulfilling it (i.e., almost every
syllable has at least one common meaning.)  

So perhaps the Chinese theory is not legally dead yet...

    > [Don Latham:] How about a coin-flipping hoax?  
    > or a random generator of some kind to build a hoax?
    
Perhaps, but someone would have to propose a plausible method that
will generate the observed distributions.

To warm things up, here is an obvious idea, which doesn't quite work.
Get nine fair coins, each of them having one side blank, one side
inscribed with a different symbol. To generate the next token, throw
the coins, and write down the symbols that come up on top, in a fixed
order, followed by a fixed marker symbol.

This method will generate all 2^9 zero-one strings, and will indeed
reproduce the binomial word length distribution as seen in the VMS. 
However, each of those combinations will be generated with equal probability,
and therefore the *token* length distribution would be binomial, too
--- which is not what we see in the VMS.

Here is another possible method. Get yourself an ordinary six-sided
die and a paper disk divided into nine sectors, each inscribed with a
different symbol. Begin by placing a black pearl on the first sector. To
generate each letter, throw the die; if you get an even number, copy
the symbol under the pearl. Then, if the outcome was `6', take the pearl back
to sector one, else move it clockwise to the next sector. Then, if the
pearl happens to be on sector one, start a new word. Repeat for the next
letter. (Of course, a black pearl is not really necessary --- a golden
scarab would work just as well.)

This method is almost equivalent to the previous one, except that 
the occasional moves back to sector one will truncate some of the 
words, and result in a *token* length distribution biased towards
short words.  On the other hand, if the text is long enough, all 
2^9 subsets of the symbols will be generated, and the distribution
of *word* lengths will be nicely binomial.

This idea requires more analysis, but I really have to go home now.
All the best, and see you next week..

--stolfi