[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: identify a text's author or language

29/01/02 11:50:17, "Anders, Claus" <Claus.Anders@xxxxxxxxxxxxx> wrote:

>1. take any text greater than n Bytes, compress it with ZIP "known text"
>2. Add more text and compress it too - this is the "unknown" text
>3. compare difference of length of compressed text in step 1 and 2 . If you
>yield a minimum difference, they claim, the "unknown" text is derived form
>the "known" text's language or even from the same author.

I would say "congruent with" or "drawn for the same corpus", rather
than "derived from". But this is nit-picking.

The question: how small is "minimum"?

I would also say that producing the zipped files is unncessary, and,
in fact, amounts to throwing out a great deal of information, since
you end up with a single figure. It would be far more informative 
to compare the two Huffmann trees computed in the first stage of
the algorithm.

(All this is off the top of my head, before I forget it)