[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: VMs: learning



On Saturday 16 Aug 2003 11:15 pm, Nick Pelling wrote:
> Technically speaking, Monkey appears to be outputting the mean average
> expressibility (in bits) of the input stream, which is an indirect measure
> of entropy - entropy should really be expressed on a 0.0 (perfectly
> predictable) to a 1.0 (perfectly unpredictable) range (using a percentage
> is quite acceptable). However, as you have to normalise out the size of the
> alphabet, this tends to be not so useful in practice... I'll explain.
>
>          (Entropy = 0.0) ==> perfectly predictable
>          (Entropy = 1.0) ==> perfectly unpredictable

If you look back into the archives I wondered about this many times 
(expressing the entropy value as a percentage of the maximum entropy). The 
problem is that the this normalised-unpredictability really depends on the 
number of different characters (the alphabet) in the source. So sources with 
large alphabets can be more unpredictable (in the sense that one has more 
choices) than those with smaller ones and this is of course reflected in the 
value of entropy.
So if one "normalises" the value to the maximum that can be achieved with a 
particular alphabet size, the same string can have a different value if the 
alphabet was taken into consideration. i.e.: abbabaaabbbbabab in a system 
that is *known* to have 2 states compared to a system that has 100000 states.
In the second case, the source would *seem* to be more predictable, yet the 
message is the same.

Have a look at the nice posting by Jim R. about "counting commas" in the evmt 
site.

> PPS: there's one final twist which entropy programs can get subtly wrong:
> you have to remember that, for an nth-order entropy, you have undefined
> values for the first (n-1) characters. It's not wildly important, but it
> tripped me up once (many years ago), so I thought I ought to share that
> too. :-)

Not only at the begining, but also at the end.
The differences are trivial if the source is long. A common workaround is to 
"wrap around" the sequence, so  the end and the start join.

Cheers,

Gabriel

______________________________________________________________________
To unsubscribe, send mail to majordomo@xxxxxxxxxxx with a body saying:
unsubscribe vms-list