[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: VMs: Huffman compression (was: VMs: Declaration of WAR against EVA)



Hi Jacques,

The key problems with using data-compression algorithms to analyse the VMS are:
(a) they produce flat stats, whereas any underlying language would have peaky stats
(b) if they use pattern-matching (like LZ77), they find effective groups, not semantic groups


For example, LZ77 (which is a commonly-used pattern-matching front-end) would quickly start finding matches above the likely semantic level of usefulness - for example, on the first line of f1r, it would probably match ".shol" backwards within the same line (ie, including "." as well as "shol").

For LZH, Huffman is then used as a "back-end coder", so that all the copy command parameters (effectively) get stored most effectively (that's what the "LZ" and "H" in "LZH" stand for). :-)

Also: Adaptive Huffman schemes achieve better performance (ie, smaller compressed files) by adapting the distributional stats (and hence the bit-lengths) during the compression. While this is a nice feature, it too might well get in the way of understanding what's going on.

It would be interesting to try out the language-comparison-via-compression-algorithm-performance on some grouped-glyph texts: I think they would show up as being closer to known languages than EVA (or Currier) transcription VMS text... but probably still not *too* close.

Worth trying, though. :-)

Cheers, .....Nick Pelling.....

______________________________________________________________________
To unsubscribe, send mail to majordomo@xxxxxxxxxxx with a body saying:
unsubscribe vms-list