Also: Stolfi found ultra-regularity in the token length distribution. This
may be explained
by a probability rule for insertion of NullSpaces.
I too thought about this.
If I was going to fake this general distribution (but instead peaking at,
say, 10), I'd take a pack of modern cards, throw out all the court cards,
and, every time I turned over an ace, insert a space. Once in a while, I'd
have to shuffle the deck: but basically that would be it.
But with average length 6, the easiest way would be to roll a normal
6-sided dice: if it's a six, insert a space. How far off is that from the
observed distribution?
Perhaps one key difference between Language A and Language B is that the
scribe doing Language B had a loaded die. :-/