[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: identify a text's author or language

To: voynich@xxxxxxxx
Subject: Re: identify a text's author or language
From: Nick Pelling <incoming@xxxxxxxxxxxxxxxxx>
Date: Tue, 29 Jan 2002 11:27:57 +0000
In-reply-to: <3143B46F3796D51190390002A5518A18233395@acnt45.ac1.dsh.de>

the is someone at an Italian University, who claims to identy an Author
and/or his language by using the ZIP algorithm.
1. take any text greater than n Bytes, compress it with ZIP "known text"
2. Add more text and compress it too - this is the "unknown" text
3. compare difference of length of compressed text in step 1 and 2 . If you
yield a minimum difference, they claim, the "unknown" text is derived form
the "known" text's language or even from the same author.
This procedure reminds me of the "entropy test", which was done on the VMS
years ago.
Any comments?

ZIP-era algorithms typically comprise two stages:- (1) a pattern-matching stage, which converts an input stream into an output stream of both copy(-offset, length) commands and uncompressed literals; and (2) a statistical (or entropy) encoder (like a Huffman or arithmetical encoder), which tries to compress the output of the first stage down to the entropy of that process' output stream.

Thus, the use of the ZIP algorithm in this "identify-the-author-and-his-language" algorithm you mention would carry out not only an entropy calculation, but also a pattern-matching calculation.

This would seem very plausible: though using the current compression algorithm of choice - the BWT ("Burrows-Wheeler Transform") - would probably yield a more sensitive test than ZIP.

FYI: the BWT sorts an input file by context (either backward or forward), which produces a very coherent output - for example, if the preceding context was "for exampl", the chances are very high that the next letter (in all occurrences) would be "e". The output of this transform is then encoded using one of the many variants of the MTF ("Move To Front") algorithm, and then statistically encoded.

Cheers, .....Nick Pelling.....

References:
- identify a text's author or language
  - From: Anders, Claus

Prev by Date: identify a text's author or language
Next by Date: Re: Dana's Botany
Previous by thread: identify a text's author or language
Next by thread: Re: identify a text's author or language
Index(es):
- Date
- Thread