[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Gallows bit sequences: how embarassing...
- To: voynich@xxxxxxxx
- Subject: Gallows bit sequences: how embarassing...
- From: Jorge Stolfi <stolfi@xxxxxxxxxxxxx>
- Date: Thu, 21 Sep 2000 06:12:29 -0300 (EST)
- Delivered-to: reeds@research.att.com
- In-reply-to: <39C8E22D.5295.18D2ED3@localhost>
- References: <39C88C9B.18295.3EDE50@localhost> <200009201448.e8KEmAt10314@coruja.dcc.unicamp.br> <39C8E22D.5295.18D2ED3@localhost>
- Reply-to: stolfi@xxxxxxxxxxxxx
- Sender: jim@xxxxxxxxxxxxx
Ahem, hum, how should I put it ...
As you unfortunately must remember, a couple of months ago I claimed
that there was a surprising correlation between the "gallows bits" of
consecutive VMs words (defined as 1 if the word has a gallows letter,
0 if it doesn't). Put another way, the sequence of gallows bits was
not random, but showed many long runs of 0's and 1's:
?1110110
00000110
1000?1110
1110100001
01010000011
111011111
111010001
00??1101
11011??0?
0100110110
1101011001
110000000001
11111110
I though the lumping of 0's and 1's was so obvious that there was no
need to compute the statistics. I proceeded to draw all sorts of
conclusions from that phenomenon; I even stopped believing in the
Chinese theory, conjectured that the language was Turkish, and spent a
fair amount of time (and e-mail) exploring this new path.
Well, I finally did my homework: I tabulated the number of runs of
each length, and ... how embarassing!
count obs.fr. exp.fr. run count obs.fr. exp.fr. run
----- ------- ------- ------------- ----- ------- ------- -------------
3660 0.28344 0.28135 0 3434 0.26593 0.27477 1
1559 0.12073 0.12403 00 1443 0.11175 0.12630 11
726 0.05622 0.05382 000 754 0.05839 0.05716 111
346 0.02679 0.02285 0000 355 0.02749 0.02532 1111
129 0.00999 0.00948 00000 180 0.01394 0.01095 11111
60 0.00465 0.00383 000000 94 0.00728 0.00462 111111
32 0.00248 0.00151 0000000 46 0.00356 0.00191 1111111
18 0.00139 0.00058 00000000 34 0.00263 0.00076 11111111
10 0.00077 0.00021 000000000 21 0.00163 0.00029 111111111
0 0.00000 0.00007 0000000000 7 0.00054 0.00010 1111111111
2 0.00015 0.00002 00000000000 1 0.00008 0.00003 11111111111
1 0.00008 0.00001 000000000000
1 0.00008 0.00000 0000000000000
These counts are derived from the whole text (minus labels), majority
version. All lines with unreadable or contentious characters were
discarded, leaving 3042 usable lines.
Each "1" means a word with one or more gallows, "0" a word with none.
Each entry above gives the count of maximal runs of 0's or 1's.
Runs were not allowed to extend across line breaks. The sample
contains 11850 "0"s (prob = 0.490) and 12330 "1"s (prob = 0.510).
The column "obs.fr." is the relative observed frequency of each run,
and "exp.fr." is its expected frequency, computed for a random string
of "0"s and "1"s with the same 0-1 bit probabilities, and same
distribution of line lengths (mean 7.94 words, mode 10 words).
As you can see, the observed distribution of run lengths is quite
close to that of random text. I.e., contrary to my claims, there is NO
significant correlation between the gallows bit of consecutive words.
The run-length statistics for individual sections show the same story.
I tried excluding short lines, and excluding the first and last run of
each line; still no correlation.
Sigh. It seems that I was stupidly fooled by a banal optical illusion.
Looking at the bit strings, the long runs of 0's and 1's are more
conspicuous than the short runs, and thus seem to be anomalously
common; but the cold statistics show that it ain't so.
So, my apologies to everyone for the false claim, and for the
(probably) irrelevant postings about Turkish linguistics. I will try
to be more careful in the future...
(There is one bright side to it, though --- the Chinese theory
is not dead after all!)
All the best,
--stolfi