[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: identify a text's author or language
the is someone at an Italian University, who claims to identy an Author
and/or his language by using the ZIP algorithm.
1. take any text greater than n Bytes, compress it with ZIP "known text"
2. Add more text and compress it too - this is the "unknown" text
3. compare difference of length of compressed text in step 1 and 2 . If you
yield a minimum difference, they claim, the "unknown" text is derived form
the "known" text's language or even from the same author.
This procedure reminds me of the "entropy test", which was done on the VMS
ZIP-era algorithms typically comprise two stages:-
(1) a pattern-matching stage, which converts an input stream into an output
stream of both copy(-offset, length) commands and uncompressed literals; and
(2) a statistical (or entropy) encoder (like a Huffman or arithmetical
encoder), which tries to compress the output of the first stage down to the
entropy of that process' output stream.
Thus, the use of the ZIP algorithm in this
"identify-the-author-and-his-language" algorithm you mention would carry
out not only an entropy calculation, but also a pattern-matching calculation.
This would seem very plausible: though using the current compression
algorithm of choice - the BWT ("Burrows-Wheeler Transform") - would
probably yield a more sensitive test than ZIP.
FYI: the BWT sorts an input file by context (either backward or forward),
which produces a very coherent output - for example, if the preceding
context was "for exampl", the chances are very high that the next letter
(in all occurrences) would be "e". The output of this transform is then
encoded using one of the many variants of the MTF ("Move To Front")
algorithm, and then statistically encoded.
Cheers, .....Nick Pelling.....