[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: identify a text's author or language
29/01/02 11:50:17, "Anders, Claus" <Claus.Anders@xxxxxxxxxxxxx> wrote:
>1. take any text greater than n Bytes, compress it with ZIP "known text"
>2. Add more text and compress it too - this is the "unknown" text
>3. compare difference of length of compressed text in step 1 and 2 . If you
>yield a minimum difference, they claim, the "unknown" text is derived form
>the "known" text's language or even from the same author.
I would say "congruent with" or "drawn for the same corpus", rather
than "derived from". But this is nit-picking.
The question: how small is "minimum"?
I would also say that producing the zipped files is unncessary, and,
in fact, amounts to throwing out a great deal of information, since
you end up with a single figure. It would be far more informative
to compare the two Huffmann trees computed in the first stage of
(All this is off the top of my head, before I forget it)