[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

VMs: meaning of entropy



Rene Zandbergen wrote:

> When running the entropy calculations, you should
> find almost identical values for (h0,) h1 and h2,
> but a high value for h3 (which can hardly be
> calculated reliably anyway).

As entropy has been discussed so much on the list recently,
I returned to my attempt to understand its various
terms. Last time I tried that (in February), I think
managed to get an idea of what entropy is and how it
is used for text analysis. What still buffles me are
those orders Rene mentions and Monkey calculates up
to 120th.

Am I correct in assuming that "h1" is "first order"
or otherwise predictability of the next character when
the preceding one is known (and the same for words).

Now, "h2" is the same calculated for pairs of characters,
"h3" for triplets, etc. Is that correct?

But Rene says: "Character-pair entropy is sometimes called 
second-order entropy, while the conditional single-character 
entropy is also sometimes called second-order entropy."
I do not remember seeing the distinction mentioned
in list discussions - so which is the Monkey terminology?

Finally, what is the meaning of "h0"? I know it is
"the base-2 logarithm of the number of different words 
(or characters) found" (Jacques in Monkey.doc) - but what
do the calculated values say about the text?

I have just located a fairly new set of text analysis
programs by Dmitry V. Khmelev (Toronto Univ.) - which
include "cross-entropy" between texts and some other
interesting concepts which might be helpful for VMS stats
(if only I could grasp the basics):

http://www.math.toronto.edu/dkhmelev/PROGS/tacu/index-eng.html

Best regards,

Rafal
______________________________________________________________________
To unsubscribe, send mail to majordomo@xxxxxxxxxxx with a body saying:
unsubscribe vms-list