[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: VMs: Declaration of WAR against EVA



Hi Petr,

At 07:07 06/03/03 +0100, Petr Kazil wrote:
On "Optimally segmenting the VMS into syllables/glyphs":

In my last mail I stated that there should be a direct algorithm to segment the VMS into groups of
characters in such a way that the nimber of different groups is minimized and mean group-length is
maximized.


I remembered that there exists such an algorithm. It's called "Huffman Coding". It's 20 years ago so
I forgot the details, but the algorithm selects the optimal code for a data-stream so that all
redundancy is removed.

In simple (context-free) Huffman coding (which I've been using for 20 years, data compression is one of my main professional skills!), you take a stream of data, calculate the probability of each particular input token in the stream occurring, and then assign variable bit-length output tokens to each one accordingly.


This way, the most frequent input tokens get translated into the shortest bit-length output tokens - and the rarest input tokens get translated into the longest bit-length output tokens.

The magical thing here is that (the size if the input file) / (the size of the output file) will tend towards the (context-free) entropy of the file. Which is quite cool. :-)

Put another way: if you have a input stream of 1000 tokens, and it happens to contain 250 instances of the letter <e>, then there's a 25% chance any letter you pick at random will be an <e>. The information content (entropy) of an <e> is therefore 25% = a 1 in 4 chance => 2 bits of information (a 1 in 8 chance => 3 bits, a 1 in 16 chance => 4 bits, etc... basically, powers of two).

What (I think) you're effectively asking is this: given that we know the probabilities of each EVA character occurring - and hence can calculate the information content (in bits) of each - can we use that knowledge to create an algorithm that somehow groups them into "super-tokens", whose entropic distribution better matches (say European) languages' distributions?

The answer is... sort of. I've created lots of bespoke data-compression algorithms on this kind of theme in the past, but where the target distribution to match is as flat as possible (ie as close to an entropic minimum as possible)... however, because grouping tokens into super-tokens changes the length of the file (ie after grouping tokens together, there are always less super-tokens in the file than there were tokens before), this changes all the stats along the way.

The algorithmic approach I normally take for these kinds of files is:
(a) calculate all likely candidate pairs that could be taken (ie, <qo>, <dy>, <ii> etc)
(b) which one of these best satisfies the criteria I'm looking for?
(c) replace all instances of that group in the text with a "supertoken"
(d) If I haven't yet matched my target distribution, repeat all over again.


Note that as the process repeats, once you have a supertoken for (say) <ii>, the likely candidate search might well produce <<ii>i> as a token.

But this whole approach is pragmatic, not theoretic, and has a basic flaw - it's non-optimal. For example, in terms of EVA, you might (in fact, probably would) generate supertokens for each of <ii>, <<ii>i>, and <<ii><ii>>... but the actual best set of supertokens you're hoping to find might actually be <in>, <ir>, <i<in>>, <i<ir>> etc. It might work... but having to match a distribution you don't know in advance is always going to make this a bit of a long shot.

What I'm saying is: "optimality" in this kind of search is (unfortunately) a bit of a dirty word. :-(

Ultimately, I think using our eyes to generate a candidate set of supertokens (or sets of glyphs) and examining the resulting distribution curves is likely to get us closer to where we want to be. This isn't because we're smarter than computers - more, that as we don't yet know exactly what we're looking for, we probably won't recognise it until we see it. Trying to systematise that kind of search would be a bit tricky... :-o

Cheers, ....Nick Pelling....

PS: if, as your email seems to suggest, the target distribution you're trying to match (through supertokenising) is flat, the stream would look completely random. So, if you're expecting to find some kind of structured content, this won't be it even slightly: however, if you're looking for pure polyalpha ciphering (but with a supertoken back-end to hide the randomness), that would be the way to go. :-)

______________________________________________________________________
To unsubscribe, send mail to majordomo@xxxxxxxxxxx with a body saying:
unsubscribe vms-list