[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: State machine hypothesis...



Hi Gabriel,

> This is polyalphabetic (we have 27 substitutions), state driven (we have 27
> states), yet the output has a low entropy. No contradiction!


Does such a mechanism output a code which shares any other
similarities to the vms text? (other than the alleged low entropy?).

It certainly also gives a peaky distribution (like the VMS), rather than a normal language-like distribution. This is the reason why the mechanism's output stream has a low entropy, and why some compressors use it as a front-end coder (the low entropy stream can then be compressed effectively using standard statistical coders).


But a 27-state sorted-by-rank polyalphabet set would frequently emit runs of the top ranking symbol (ie rank#0, rank#0, rank#0....): which plainly doesn't happen in the VMS (the idea of coding [space] for repeated characters notwithstanding).

All the same, my experience with these kinds of Shannon coders in recent years makes me think that a 10-state sorted-by-rank polyalphabet mechanism would probably produce data fairly close to the curves seen here.

Quick observation: I can see how using (space) to mean "same rank as last symbol" could account for the distribution seen in space-delimited word-lengths. Effectively, this would be making a deterministic binary decision ("is this the same rank as the previous symbol?") with a certain general probability for each symbol processed, which would give something like the word-length distribution as observed.

Can you give an example?

I'll try to hack together an encoder, and see where it leads.


Cheers, .....Nick Pelling.....