[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: VMs: VMS: entropy of my plaintext data for comparison purposes
22/12/2003 12:06:39 AM, Dennis <tsalagi@xxxxxxxx> wrote:
> Hi, Jacques,
>
> I think that h1-h2 may be the important statistic,
>since that tends to take out the size of the character
>set:
Sorry, a few seconds ago I pressed "send" without
having written anything. Dumb frog.
I reduced both Mario and Cesar to a 20-letter alphabet
by using digraphs (so their h0 are identical), so the
absolute values should be significant for these two.
I have a feeling that h1-h2 might also have to do
with the frequency of digraphs. I must try the
phi statistics. It should be a better measure
of entropy. Consider this: if the text is
completely random, then the expected frequencies
in the tables such as I posted should be equal
to the observed frequencies. So chi2 = 0 and
therefore phi = 0 too. The length of the text
should not affect this statistics, because,
when calculating chi2, you must ignore cells
with expected frequencies < 5. If your text
is too short, then you'll have no cells >= 5,
and the value of chi2 will be undefined.
Having calculated chi2, you can estimate
the probability that the value obtained is
significantly different from zero. Phi, on
the other hand, gives you the size of the
difference from randomness. So, phi tells you
how far from random your text is, and chi2
how certain you can be of that. This strikes
me as potentially a much, much better measure
of entropy.
Whatever comes out of all this, it looks
like "my language", even unencrypted, with
all its 31 letters, is much more Voynich-like
than English.
But this has been keeping me from finish my
Easter Island article... :-( merde alors!
______________________________________________________________________
To unsubscribe, send mail to majordomo@xxxxxxxxxxx with a body saying:
unsubscribe vms-list