[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: identify a text's author or language

the is someone at an Italian University, who claims to identy an Author
and/or his language by using the ZIP algorithm.
1. take any text greater than n Bytes, compress it with ZIP "known text"
2. Add more text and compress it too - this is the "unknown" text
3. compare difference of length of compressed text in step 1 and 2 . If you
yield a minimum difference, they claim, the "unknown" text is derived form
the "known" text's language or even from the same author.
This procedure reminds me of the "entropy test", which was done on the VMS
years ago.
Any comments?

ZIP-era algorithms typically comprise two stages:-
(1) a pattern-matching stage, which converts an input stream into an output stream of both copy(-offset, length) commands and uncompressed literals; and
(2) a statistical (or entropy) encoder (like a Huffman or arithmetical encoder), which tries to compress the output of the first stage down to the entropy of that process' output stream.

Thus, the use of the ZIP algorithm in this "identify-the-author-and-his-language" algorithm you mention would carry out not only an entropy calculation, but also a pattern-matching calculation.

This would seem very plausible: though using the current compression algorithm of choice - the BWT ("Burrows-Wheeler Transform") - would probably yield a more sensitive test than ZIP.

FYI: the BWT sorts an input file by context (either backward or forward), which produces a very coherent output - for example, if the preceding context was "for exampl", the chances are very high that the next letter (in all occurrences) would be "e". The output of this transform is then encoded using one of the many variants of the MTF ("Move To Front") algorithm, and then statistically encoded.

Cheers, .....Nick Pelling.....