[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: VMs: learning

To: vms-list@xxxxxxxxxxx
Subject: Re: VMs: learning
From: Nick Pelling <incoming@xxxxxxxxxxxxxxxxx>
Date: Sat, 16 Aug 2003 23:15:10 +0100
In-reply-to: <3F3EA071.9FDF130F@amu.edu.pl>
References: <3F3CB66B.1994460F@orchidserve.com> <3F3DB839.3E0C5591@asus.net> <GWmAPxAZPnP$Ewsp@sphynx.demon.co.uk> <3F3E79AA.FACAF06F@amu.edu.pl> <dJZ24+AkXoP$Ewff@sphynx.demon.co.uk>
Reply-to: vms-list@xxxxxxxxxxx
Sender: owner-vms-list@xxxxxxxxxxx

Hi Rafal,

Sorry for taking advantage of the list to learn something
about entropy. I have now loaded the java calculator
with the full text of Kalevala (long poetic epic in Finnish)
and with Galileo's Latin text of Sidereus Nuncius
(with some numbers) - both converted to lower case.
The results for character 1-3 entropies are these:

Kalevala (500 Kb)    (1) 4.33 (2) 7.66 (3) 10.50
---> Monkey (32 Kb)      3.85     3.25      3.11

Sidereus Nuncius
by Galileo (72 Kb)   (1) 4.23 (2) 7.57 (3) 10.22
---> Monkey (25 Kb)      4.00     3.21      2.52

It seems to me, therefore, that the java program (let's call
it JEC) shows unconditional entropy of pairs, triplets, etc.
The differences for 1st order is probably due to
(a) size of the sample, (b) recognizing 256 chars by JEC,
and (c) filtering off some non-literals by both program
which I haven't unified yet.

Does this make sense?

Perfectly - and it looks like Monkey is correct in this instance. :-)

FWIW, entropy is the statistical balance between predictability and unpredictability in a given text, when using a particular model - when you've inversely factored out all the predictability, entropy is the (unpredictable) residue.

Technically speaking, Monkey appears to be outputting the mean average expressibility (in bits) of the input stream, which is an indirect measure of entropy - entropy should really be expressed on a 0.0 (perfectly predictable) to a 1.0 (perfectly unpredictable) range (using a percentage is quite acceptable). However, as you have to normalise out the size of the alphabet, this tends to be not so useful in practice... I'll explain.

        (Entropy = 0.0) ==> perfectly predictable
        (Entropy = 1.0) ==> perfectly unpredictable

For example: the 0th-order (ie, context-less) entropy of a set of 1000 completely random numbers in the range (0..255) should be 1.0 (as it is perfectly unpredictable), but the mean average expressibility of the stream would be 8.0 bits per unit in the stream. This latter figure is what Monkey calls "h1" (if I'm not mistaken).

By way of contrast, "h0" (as described) should equal log2(<number of choices per element>), which here equals 8.0 (because <2 to the power of 8.0> == 256). This is (AIUI) what you need to normalise out the first value in order to calculate entropy.

Therefore, (and I hope you're following me so far):-

0th-order entropy = h1 / h0

The meaning of h2 isn't ~quite~ so simple (please don't run for the hills!), but I'll do my best... starting with entropy again (but moving on up to 1st order this time round).

The "1st-order entropy" of a sequence of numbers is the average predictability of each element in the sequence, given the additional knowledge of what the preceding element in the sequence was (note that we've gone from an unordered set to an ordered sequence here).

If we have (say) 256 contexts, then one way of tackling this problem is to decompose it into 256 separate 0th-order entropies (ie, one for each of the 1st order contexts) and then calculate theit mean weighted average. Et voila - you've just calculated a 1st-order entropy (ie, Monkey's "h2").

To calculate a 2nd order entropy of an 8-bit stream, you'd need to calculate a possible 256 * 256 contexts - straightforward enough, but this is starting to get a bit memory-hungry if you tackle this in a simple-minded linear way.

What's happening here is that the number of *potential* contexts is exponentially increasing with order, but the number of *actual* contexts is fairly small (for typical texts) - so the (actual/potential) ratio is rapidly decreasing (ie, its "token density in context space"). FYI, a good software technique for handling this kind of sparse array problem is "associative arrays" (which is what Perl and JavaScript use) - given a five character context (ie, for 5th-order entropy calculation) "ABCDE" (say), then use that as an associative index into an array, ie context5["ABCDE"].

Cheers, .....Nick Pelling.....

PS: note that I haven't actually tested Monkey myself. :-)

PPS: there's one final twist which entropy programs can get subtly wrong: you have to remember that, for an nth-order entropy, you have undefined values for the first (n-1) characters. It's not wildly important, but it tripped me up once (many years ago), so I thought I ought to share that too. :-)


______________________________________________________________________
To unsubscribe, send mail to majordomo@xxxxxxxxxxx with a body saying:
unsubscribe vms-list

Follow-Ups:
- Re: VMs: learning
  - From: Nick Pelling
- Re: VMs: learning
  - From: Gabriel Landini

References:
- VMs: 1st Impressions and baby steps
  - From: Barbara Barrett
- Re: VMs: 1st Impressions and baby steps
  - From: Dennis
- VMs: Video [was 1st Impressions and baby steps]
  - From: ajb
- Re: VMs: Video [was 1st Impressions and baby steps]
  - From: Rafal T. Prinke
- Re: VMs: Video [was 1st Impressions and baby steps]
  - From: ajb
- VMs: learning
  - From: Rafal T. Prinke

Prev by Date: VMs: learning
Next by Date: Re: VMs: learning
Previous by thread: VMs: learning
Next by thread: Re: VMs: learning
Index(es):
- Date
- Thread