[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Curious coincidence



    > [Stolfi:] Well, the variance of a 0-1 coin toss is 1/2, right? 
    
Wrong! It's 1/4, you dumbo!

    > [Gabriel:] SD makes sense only if the distribution is normal...
    > but coin tossing is not a gaussian process.
    
The variance (and standard deviation) are defined for any distribution.
(Ok, sometimes the formula yields +oo, but not in this case.)

If two variables are independent, the variance of their sum is the sum
of their variances.  So the sum of N idependent variables with 
the same variance V has variance N*V.  This too holds for any distribution.

There is a formula (Chebyshev's) that gives an X% confidence interval
for any distribution, given its mean and variance. Unfortunately,
those intervals get quite broad as X approaches 100%.  However, the sum
of a large enough number of independent variables with the same
distribution will be indistinguishable from a Gaussian --- for which 
one can compute much tighter confidence intervals, even for X near 100%.
(That is one of the many reasons why statisticians are fond of Gaussians 8-)
    
    > What is that calculation trying to explain?

A proposed explanation for the almost 50-50 split was that the
"gallows bit" of each word was an independent, uniform 0-1 random
variable. The count of words with gallows would then be the sum
of N such variables, whose expected value is N/2.

But, of course, the sum is itself a random variable, so we cannot
expect the count to be exactly N/2. Rene wondered whether the observed
count was *too* close to N/2 for being the result of adding N coin
tosses. That would weaken the random-bit theory by making alternative
explanations more likely --- e.g., that the number of gallows was
intentionally adjusted for a 50/50 split, or that the gallows were
generated by a (partially?) deterministic rule that ensured the even
split.

My SD calculation was meant to show that the observed deviation from N/2
(~40 tokens) is roughly in the ballpark of the deviation predicted for
the sum of N = 34806 independent random bits. The (correct this time, I hope)
variance of the latter is (1/4)N, so the standard deviation is
sqrt(34806/4) = 93. So the observed deviation from equality is in fact 
on the small side, but not *too* small.

    > [Jim Reeds:] But I share Gabriel's confusion about what the
    > calculation is trying to explain. I took the given 2 by 2 count
    > data, (8772, 9016; 8591, 8423) and worked out two chi-squared
    > test statistics for it. ... the 4 counts are obviously unequal,
    > and seem to look inhomogeneous.

Thanks for the analysis; but now it is my turn to be confused.  

I gather that the goal of the chi-square test is to answer the
question: "Can we explain the discrepancies we see between those
counts as the result of sampling error, or can we confidently say that
the expected value of count X is greater than that of Y?"

Indeed, the 2x2 table entries show sizable deviations from the uniform
4-way split. This is true especially between top and bottom, and more
so on the right column.

I don't question your claim that the deviations are statistically
significant. This only means that the "table letters" bit *as computed*
definitely has a small bias towards 0 (and, I presume, also a small
negative correlation with the gallows bit). But, in my view, this
bias doesn't mean that the the numbers are "uninteresting".

For one thing, we know that the data is noisy, and that the errors are
*not* fairly distributed. While it is hard to mistake a gallows for
anything else, it is easy to misread a "ch" or "sh" for other
character combinations. Moreover, it is conceivable that words that
contain tables are harder to read, and therefore are more likely to be
rejected for not reaching majority agreement. Finally, I may be using
a slightly incorrect definition of "table letter". (You may recall the
puzzling absence of "e" after "p" and "f", and my conjecture about
hooked arms on those letters.) All these errors are likely to produce
a systematic downward bias in the "table letter" count.

So I think that the chi-square tests don't quite answer the question
of whether the gallows/table splits are "surprisingly even" or not.

To give an analogy: suppose we tabulate the crosses of two pea plants
according to pink/white flowers, smooth/wrinkled seed ;-); and we
get numbers like my 2x2 table, with a small *but statistically
significant* deviation from equality.  

Should we say "They are just different numbers, so what"? Or are we
still entitled to speculate about an underlying *symmetrical* random
choice process, plus slight external disturbances?

All the best,

--stolfi