[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: VMs: Declaration of WAR against EVA
Hi Petr,
At 07:07 06/03/03 +0100, Petr Kazil wrote:
On "Optimally segmenting the VMS into syllables/glyphs":
In my last mail I stated that there should be a direct algorithm to
segment the VMS into groups of
characters in such a way that the nimber of different groups is minimized
and mean group-length is
maximized.
I remembered that there exists such an algorithm. It's called "Huffman
Coding". It's 20 years ago so
I forgot the details, but the algorithm selects the optimal code for a
data-stream so that all
redundancy is removed.
In simple (context-free) Huffman coding (which I've been using for 20
years, data compression is one of my main professional skills!), you take a
stream of data, calculate the probability of each particular input token in
the stream occurring, and then assign variable bit-length output tokens to
each one accordingly.
This way, the most frequent input tokens get translated into the shortest
bit-length output tokens - and the rarest input tokens get translated into
the longest bit-length output tokens.
The magical thing here is that (the size if the input file) / (the size of
the output file) will tend towards the (context-free) entropy of the file.
Which is quite cool. :-)
Put another way: if you have a input stream of 1000 tokens, and it happens
to contain 250 instances of the letter <e>, then there's a 25% chance any
letter you pick at random will be an <e>. The information content (entropy)
of an <e> is therefore 25% = a 1 in 4 chance => 2 bits of information (a 1
in 8 chance => 3 bits, a 1 in 16 chance => 4 bits, etc... basically, powers
of two).
What (I think) you're effectively asking is this: given that we know the
probabilities of each EVA character occurring - and hence can calculate the
information content (in bits) of each - can we use that knowledge to create
an algorithm that somehow groups them into "super-tokens", whose entropic
distribution better matches (say European) languages' distributions?
The answer is... sort of. I've created lots of bespoke data-compression
algorithms on this kind of theme in the past, but where the target
distribution to match is as flat as possible (ie as close to an entropic
minimum as possible)... however, because grouping tokens into super-tokens
changes the length of the file (ie after grouping tokens together, there
are always less super-tokens in the file than there were tokens before),
this changes all the stats along the way.
The algorithmic approach I normally take for these kinds of files is:
(a) calculate all likely candidate pairs that could be taken (ie, <qo>,
<dy>, <ii> etc)
(b) which one of these best satisfies the criteria I'm looking for?
(c) replace all instances of that group in the text with a "supertoken"
(d) If I haven't yet matched my target distribution, repeat all over again.
Note that as the process repeats, once you have a supertoken for (say)
<ii>, the likely candidate search might well produce <<ii>i> as a token.
But this whole approach is pragmatic, not theoretic, and has a basic flaw -
it's non-optimal. For example, in terms of EVA, you might (in fact,
probably would) generate supertokens for each of <ii>, <<ii>i>, and
<<ii><ii>>... but the actual best set of supertokens you're hoping to find
might actually be <in>, <ir>, <i<in>>, <i<ir>> etc. It might work... but
having to match a distribution you don't know in advance is always going to
make this a bit of a long shot.
What I'm saying is: "optimality" in this kind of search is (unfortunately)
a bit of a dirty word. :-(
Ultimately, I think using our eyes to generate a candidate set of
supertokens (or sets of glyphs) and examining the resulting distribution
curves is likely to get us closer to where we want to be. This isn't
because we're smarter than computers - more, that as we don't yet know
exactly what we're looking for, we probably won't recognise it until we see
it. Trying to systematise that kind of search would be a bit tricky... :-o
Cheers, ....Nick Pelling....
PS: if, as your email seems to suggest, the target distribution you're
trying to match (through supertokenising) is flat, the stream would look
completely random. So, if you're expecting to find some kind of structured
content, this won't be it even slightly: however, if you're looking for
pure polyalpha ciphering (but with a supertoken back-end to hide the
randomness), that would be the way to go. :-)
______________________________________________________________________
To unsubscribe, send mail to majordomo@xxxxxxxxxxx with a body saying:
unsubscribe vms-list