[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Gallows bit sequences: how embarassing...




Ahem, hum, how should I put it ...

As you unfortunately must remember, a couple of months ago I claimed
that there was a surprising correlation between the "gallows bits" of
consecutive VMs words (defined as 1 if the word has a gallows letter,
0 if it doesn't). Put another way, the sequence of gallows bits was
not random, but showed many long runs of 0's and 1's:

    ?1110110
    00000110
    1000?1110
    1110100001
    01010000011
    111011111
    111010001
    00??1101
    11011??0?
    0100110110
    1101011001
    110000000001
    11111110

I though the lumping of 0's and 1's was so obvious that there was no
need to compute the statistics. I proceeded to draw all sorts of
conclusions from that phenomenon; I even stopped believing in the
Chinese theory, conjectured that the language was Turkish, and spent a
fair amount of time (and e-mail) exploring this new path.

Well, I finally did my homework: I tabulated the number of runs of
each length, and ... how embarassing!

  count obs.fr. exp.fr. run            count obs.fr. exp.fr. run
  ----- ------- ------- -------------  ----- ------- ------- -------------
   3660 0.28344 0.28135 0               3434 0.26593 0.27477 1          
   1559 0.12073 0.12403 00              1443 0.11175 0.12630 11         
    726 0.05622 0.05382 000              754 0.05839 0.05716 111        
    346 0.02679 0.02285 0000             355 0.02749 0.02532 1111       
    129 0.00999 0.00948 00000            180 0.01394 0.01095 11111      
     60 0.00465 0.00383 000000            94 0.00728 0.00462 111111     
     32 0.00248 0.00151 0000000           46 0.00356 0.00191 1111111    
     18 0.00139 0.00058 00000000          34 0.00263 0.00076 11111111   
     10 0.00077 0.00021 000000000         21 0.00163 0.00029 111111111  
      0 0.00000 0.00007 0000000000         7 0.00054 0.00010 1111111111 
      2 0.00015 0.00002 00000000000        1 0.00008 0.00003 11111111111
      1 0.00008 0.00001 000000000000                 
      1 0.00008 0.00000 0000000000000                 

These counts are derived from the whole text (minus labels), majority
version. All lines with unreadable or contentious characters were
discarded, leaving 3042 usable lines.

Each "1" means a word with one or more gallows, "0" a word with none.
Each entry above gives the count of maximal runs of 0's or 1's.
Runs were not allowed to extend across line breaks. The sample
contains 11850 "0"s (prob = 0.490) and 12330 "1"s (prob = 0.510).

The column "obs.fr." is the relative observed frequency of each run,
and "exp.fr." is its expected frequency, computed for a random string
of "0"s and "1"s with the same 0-1 bit probabilities, and same
distribution of line lengths (mean 7.94 words, mode 10 words).


As you can see, the observed distribution of run lengths is quite
close to that of random text. I.e., contrary to my claims, there is NO
significant correlation between the gallows bit of consecutive words.

The run-length statistics for individual sections show the same story.
I tried excluding short lines, and excluding the first and last run of
each line; still no correlation.

Sigh. It seems that I was stupidly fooled by a banal optical illusion.
Looking at the bit strings, the long runs of 0's and 1's are more
conspicuous than the short runs, and thus seem to be anomalously
common; but the cold statistics show that it ain't so.

So, my apologies to everyone for the false claim, and for the
(probably) irrelevant postings about Turkish linguistics. I will try
to be more careful in the future...

(There is one bright side to it, though --- the Chinese theory 
is not dead after all!)

All the best,

--stolfi