[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: identify a text's author or language

At 11:01 PM 1/29/02 +0000, Jacques Guy wrote:
29/01/02 11:50:17, "Anders, Claus" <Claus.Anders@xxxxxxxxxxxxx> wrote:

>1. take any text greater than n Bytes, compress it with ZIP "known text" >2. Add more text and compress it too - this is the "unknown" text >3. compare difference of length of compressed text in step 1 and 2 . If you >yield a minimum difference, they claim, the "unknown" text is derived form >the "known" text's language or even from the same author.

I would say "congruent with" or "drawn for the same corpus", rather
than "derived from". But this is nit-picking.

I'd agree. It would also be useless with something taken from physical and oral transmission, to text,
or based on something secret or esoteric.
Eg: Carlos casteneda, L. Ron Hubbard

The question: how small is "minimum"?

I would also say that producing the zipped files is unncessary, and,
in fact, amounts to throwing out a great deal of information, since
you end up with a single figure. It would be far more informative
to compare the two Huffmann trees computed in the first stage of
the algorithm.

(All this is off the top of my head, before I forget it)

I like the sentence structure analysers. The shareware ones are adequate, but
i'd like to have the industrial grade ones used by the three letter agencys. Thats
a more in depth analysis. Because it can find sentence deviations which can be
cut-and-paste's or emotional content. There is a nice balance between cryptological
and psychological analysis. Not just informational analysis.