[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Curious coincidence
> [Jim Reeds:] Stolfi, I am struck by the almost exact 50-50%
> split into 0g and 1g words. Would it be possible to divide the
> text into a few large portions ("A" vs "B", say, or "Bio" vs
> "Non bio" or "front" vs "back) and see if the pheonomenom holds
> in the pieces, as well?
Er, um, I just did it, and the results are a bit disappointing:
|||| ||||
VVVV VVVV
+ ----- + ------------------------------- + -------------------------- +
| | counts | fractions |
+ ----- + ----- + ----------------------- + ------------------- + ---- +
| | | gallows in word | gallows in word | |
| sec | tot | 0 1 2 ? | 0 1 2 ? | SD |
+ ----- + ----- + ----- ----- ----- ----- + ---- ---- ---- ---- + ---- +
| str.2 | 10768 | 4610 5402 98 658 | .428 .502 .009 .061 | .005 |
| hea.1 | 6866 | 3565 3087 73 140 | .519 .450 .011 .020 | .006 |
| bio.1 | 6828 | 3343 3185 30 269 | .490 .466 .004 .039 | .006 |
| heb.1 | 2901 | 1284 1504 39 74 | .443 .518 .013 .026 | .009 |
| cos.2 | 1491 | 690 655 19 127 | .463 .439 .013 .085 | .013 |
| pha.2 | 1426 | 703 600 9 114 | .493 .421 .006 .080 | .013 |
| zod.1 | 1010 | 351 338 12 308 | .348 .335 .012 .305 | .016 |
| pha.1 | 926 | 527 327 4 68 | .569 .353 .004 .073 | .016 |
| cos.3 | 884 | 397 319 1 167 | .449 .361 .001 .189 | .017 |
| hea.2 | 868 | 425 389 12 42 | .490 .448 .014 .048 | .017 |
| str.1 | 755 | 322 348 3 82 | .426 .461 .004 .109 | .018 |
| heb.2 | 557 | 212 296 2 47 | .381 .531 .004 .084 | .021 |
| unk.6 | 489 | 186 240 6 57 | .380 .491 .012 .117 | .023 |
| unk.7 | 387 | 150 205 2 30 | .388 .530 .005 .078 | .025 |
| unk.5 | 342 | 148 159 2 33 | .433 .465 .006 .096 | .027 |
| unk.4 | 302 | 135 157 4 6 | .447 .520 .013 .020 | .029 |
| unk.1 | 213 | 106 95 1 11 | .498 .446 .005 .052 | .034 |
| cos.1 | 185 | 119 49 . 17 | .643 .265 . .092 | .037 |
| unk.2 | 140 | 74 57 5 4 | .529 .407 .036 .029 | .042 |
| unk.3 | 47 | 16 27 1 3 | .340 .574 .021 .064 | .073 |
+ ----- + ----- + ----- ----- ----- ----- + ---- ---- ---- ---- + ---- +
| txt.n | 37385 | 17363 17439 323 2257 | .464 .466 .009 .060 | .003 |
+ ----- + ----- + ----- ----- ----- ----- + ---- ---- ---- ---- + ---- +
| lab.n | 1154 | 386 590 29 149 | .334 .511 .025 .129 | .015 |
+ ----- + ----- + ----- ----- ----- ----- + ---- ---- ---- ---- + ---- +
The "?" column counts tokens that were rejected because they contained weirdos,
unreadable characters, or characters without a majority reading.
Each line is basically a section, except that the herbal pages were first
split according to language, and then some sections (including herbal) were
split into contiguous blocks. See the page list below.
The line "txt.n" is the concatenation of all the sections. The line "lab.n"
is the list of all labels (which were not included in any of the preceding
lines).
The "SD" column is the standard deviation of the sampling error for the
fractions, sqrt(1/(4*N)).
The splits for individual sections are not too far from 50-50, which may
still be a hint of something. However, it is obvious that the differences
are now statistically significant.
I must admit the almost-even split for the total counts now looks more like
a meaningless coincidence.
All the best, I guess... 8-(
--stolfi
----------------------------------------------------------------------
Pages in each section:
bio.1
f75r f75v f76r f76v f77r f77v f78r f78v f79r f79v
f80r f80v f81r f81v f82r f82v f83r f83v f84r f84v
cos.1
f57v
cos.2
f67r1 f67r2 f67v2 f67v1 f68r1 f68r2 f68r3 f68v3
f68v2 f68v1 f69r f69v f70r1 f70r2
cos.3
f85r2 f86v4 f85v2 f86v3
hea.1
f1v f2r f2v f3r f3v f4r f4v f5r f5v f6r f6v f7r
f7v f8r f8v f9r f9v f10r f10v f11r f11v f13r
f13v f14r f14v f15r f15v f16r f16v f17r f17v
f18r f18v f19r f19v f20r f20v f21r f21v f22r
f22v f23r f23v f24r f24v f25r f25v f27r f27v
f28r f28v f29r f29v f30r f30v f32r f32v f35r
f35v f36r f36v f37r f37v f38r f38v f42r f42v
f44r f44v f45r f45v f47r f47v f49r f51r f51v
f52r f52v f53r f53v f54r f54v f56r f56v
hea.2
f87r f87v f90r1 f90r2 f90v2 f90v1 f93r f93v
f96r f96v
heb.1
f26r f26v f31r f31v f33r f33v f34r f34v f39r
f39v f40r f40v f41r f41v f43r f43v f46r f46v
f48r f48v f50r f50v f55r f55v f57r f66v
heb.2
f94r f94v f95r1 f95r2 f95v2 f95v1
pha.1
f88r f88v f89r1 f89r2 f89v2 f89v1
pha.2
f99r f99v f100r f100v f101r1 f101v2 f102r1 f102r2
f102v2 f102v1
str.1
f58r f58v
str.2
f103r f103v f104r f104v f105r f105v f106r f106v
f107r f107v f108r f108v f111r f111v f112r f112v
f113r f113v f114r f114v f115r f115v f116r
unk.1
f1r
unk.2
f49v
unk.3
f65r f65v
unk.4
f66r
unk.5
f85r1
unk.6
f86v6
unk.7
f86v5
unk.8
f116v
zod.1
f70v2 f70v1 f71r f71v f72r1 f72r2 f72r3 f72v3
f72v2 f72v1 f73r f73v
>
> --
> Jim Reeds, AT&T Labs - Research
> Shannon Laboratory, Room C229, Building 103
> 180 Park Avenue, Florham Park, NJ 07932-0971, USA
>
> reeds@xxxxxxxxxxxxxxxx, phone: +1 973 360 8414, fax: +1 973 360 8178
>
>