[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Curious coincidence



    > [Jim Reeds:] Stolfi, I am struck by the almost exact 50-50%
    > split into 0g and 1g words. Would it be possible to divide the
    > text into a few large portions ("A" vs "B", say, or "Bio" vs
    > "Non bio" or "front" vs "back) and see if the pheonomenom holds
    > in the pieces, as well?
    
Er, um, I just did it, and the results are a bit disappointing:    

                                              |||| ||||
                                              VVVV VVVV
                                              
  + ----- + ------------------------------- + -------------------------- +
  |       |              counts             |        fractions           |
  + ----- + ----- + ----------------------- + ------------------- + ---- + 
  |       |       |    gallows in word      |    gallows in word  |      |
  | sec   |   tot |     0     1     2     ? |    0    1    2    ? |   SD |
  + ----- + ----- + ----- ----- ----- ----- + ---- ---- ---- ---- + ---- +
  | str.2 | 10768 |  4610  5402    98   658 | .428 .502 .009 .061 | .005 |
  | hea.1 |  6866 |  3565  3087    73   140 | .519 .450 .011 .020 | .006 |
  | bio.1 |  6828 |  3343  3185    30   269 | .490 .466 .004 .039 | .006 |
  | heb.1 |  2901 |  1284  1504    39    74 | .443 .518 .013 .026 | .009 |
  | cos.2 |  1491 |   690   655    19   127 | .463 .439 .013 .085 | .013 |
  | pha.2 |  1426 |   703   600     9   114 | .493 .421 .006 .080 | .013 |
  | zod.1 |  1010 |   351   338    12   308 | .348 .335 .012 .305 | .016 |
  | pha.1 |   926 |   527   327     4    68 | .569 .353 .004 .073 | .016 |
  | cos.3 |   884 |   397   319     1   167 | .449 .361 .001 .189 | .017 |
  | hea.2 |   868 |   425   389    12    42 | .490 .448 .014 .048 | .017 |
  | str.1 |   755 |   322   348     3    82 | .426 .461 .004 .109 | .018 |
  | heb.2 |   557 |   212   296     2    47 | .381 .531 .004 .084 | .021 |
  | unk.6 |   489 |   186   240     6    57 | .380 .491 .012 .117 | .023 |
  | unk.7 |   387 |   150   205     2    30 | .388 .530 .005 .078 | .025 |
  | unk.5 |   342 |   148   159     2    33 | .433 .465 .006 .096 | .027 |
  | unk.4 |   302 |   135   157     4     6 | .447 .520 .013 .020 | .029 |
  | unk.1 |   213 |   106    95     1    11 | .498 .446 .005 .052 | .034 |
  | cos.1 |   185 |   119    49     .    17 | .643 .265 .    .092 | .037 |
  | unk.2 |   140 |    74    57     5     4 | .529 .407 .036 .029 | .042 |
  | unk.3 |    47 |    16    27     1     3 | .340 .574 .021 .064 | .073 |
  + ----- + ----- + ----- ----- ----- ----- + ---- ---- ---- ---- + ---- +
  | txt.n | 37385 | 17363 17439   323  2257 | .464 .466 .009 .060 | .003 |
  + ----- + ----- + ----- ----- ----- ----- + ---- ---- ---- ---- + ---- +
  | lab.n |  1154 |   386   590    29   149 | .334 .511 .025 .129 | .015 |
  + ----- + ----- + ----- ----- ----- ----- + ---- ---- ---- ---- + ---- +

The "?" column counts tokens that were rejected because they contained weirdos,
unreadable characters, or characters without a majority reading.

Each line is basically a section, except that the herbal pages were first
split according to language, and then some sections (including herbal) were
split into contiguous blocks. See the page list below.

The line "txt.n" is the concatenation of all the sections. The line "lab.n"
is the list of all labels (which were not included in any of the preceding
lines).

The "SD" column is the standard deviation of the sampling error for the
fractions, sqrt(1/(4*N)).

The splits for individual sections are not too far from 50-50, which may
still be a hint of something. However, it is obvious that the differences 
are now statistically significant. 

I must admit the almost-even split for the total counts now looks more like
a meaningless coincidence.

All the best, I guess... 8-(

--stolfi

----------------------------------------------------------------------
Pages in each section:

  bio.1
    f75r f75v f76r f76v f77r f77v f78r f78v f79r f79v
    f80r f80v f81r f81v f82r f82v f83r f83v f84r f84v

  cos.1
    f57v

  cos.2
    f67r1 f67r2 f67v2 f67v1 f68r1 f68r2 f68r3 f68v3
    f68v2 f68v1 f69r f69v f70r1 f70r2

  cos.3
    f85r2 f86v4 f85v2 f86v3

  hea.1
    f1v f2r f2v f3r f3v f4r f4v f5r f5v f6r f6v f7r
    f7v f8r f8v f9r f9v f10r f10v f11r f11v f13r
    f13v f14r f14v f15r f15v f16r f16v f17r f17v
    f18r f18v f19r f19v f20r f20v f21r f21v f22r
    f22v f23r f23v f24r f24v f25r f25v f27r f27v
    f28r f28v f29r f29v f30r f30v f32r f32v f35r
    f35v f36r f36v f37r f37v f38r f38v f42r f42v
    f44r f44v f45r f45v f47r f47v f49r f51r f51v
    f52r f52v f53r f53v f54r f54v f56r f56v

  hea.2
    f87r f87v f90r1 f90r2 f90v2 f90v1 f93r f93v
    f96r f96v

  heb.1
    f26r f26v f31r f31v f33r f33v f34r f34v f39r
    f39v f40r f40v f41r f41v f43r f43v f46r f46v
    f48r f48v f50r f50v f55r f55v f57r f66v

  heb.2
    f94r f94v f95r1 f95r2 f95v2 f95v1

  pha.1
    f88r f88v f89r1 f89r2 f89v2 f89v1

  pha.2
    f99r f99v f100r f100v f101r1 f101v2 f102r1 f102r2
    f102v2 f102v1

  str.1
    f58r f58v

  str.2
    f103r f103v f104r f104v f105r f105v f106r f106v
    f107r f107v f108r f108v f111r f111v f112r f112v
    f113r f113v f114r f114v f115r f115v f116r

  unk.1
    f1r

  unk.2
    f49v

  unk.3
    f65r f65v

  unk.4
    f66r

  unk.5
    f85r1

  unk.6
    f86v6

  unk.7
    f86v5

  unk.8
    f116v

  zod.1
    f70v2 f70v1 f71r f71v f72r1 f72r2 f72r3 f72v3
    f72v2 f72v1 f73r f73v


    
    > 
    > -- 
    > Jim Reeds, AT&T Labs - Research
    > Shannon Laboratory, Room C229, Building 103
    > 180 Park Avenue, Florham Park, NJ 07932-0971, USA
    > 
    > reeds@xxxxxxxxxxxxxxxx, phone: +1 973 360 8414, fax: +1 973 360 8178
    > 
    >