[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Duplicate word search

On Sat, 10 Mar 2001, Claus Anders wrote:
> To compute the hash value for each word difference, I used the formula:
> d=sum(sqrt(c1n*c1n-c2n*c2n)) with c1n the nth character of word 1 and c2n
> the same for word 2. if d was lower than a certain e , the word with the
> higher frequency was chosen.

I think you might get better results with
sqrt(sum((c1n-c2n)*(c1n-c2n))).  That's a much more conventional
"distance" measurement, and has the advantage that you don't need to do
any "choosing" of which word should be c1 and which should be c2 - the
argument of sqrt() is guaranteed to be nonnegative.

Matthew Skala
mskala@xxxxxxxxxxxxxxxxx                   :CVECAT DELENDA EST