[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: VMs: Huffman compression (was: VMs: Declaration of WAR against EVA)
Hi Jacques,
The key problems with using data-compression algorithms to analyse the VMS are:
(a) they produce flat stats, whereas any underlying language would have
peaky stats
(b) if they use pattern-matching (like LZ77), they find effective groups,
not semantic groups
For example, LZ77 (which is a commonly-used pattern-matching front-end)
would quickly start finding matches above the likely semantic level of
usefulness - for example, on the first line of f1r, it would probably match
".shol" backwards within the same line (ie, including "." as well as "shol").
For LZH, Huffman is then used as a "back-end coder", so that all the copy
command parameters (effectively) get stored most effectively (that's what
the "LZ" and "H" in "LZH" stand for). :-)
Also: Adaptive Huffman schemes achieve better performance (ie, smaller
compressed files) by adapting the distributional stats (and hence the
bit-lengths) during the compression. While this is a nice feature, it too
might well get in the way of understanding what's going on.
It would be interesting to try out the
language-comparison-via-compression-algorithm-performance on some
grouped-glyph texts: I think they would show up as being closer to known
languages than EVA (or Currier) transcription VMS text... but probably
still not *too* close.
Worth trying, though. :-)
Cheers, .....Nick Pelling.....
______________________________________________________________________
To unsubscribe, send mail to majordomo@xxxxxxxxxxx with a body saying:
unsubscribe vms-list