[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Duplicate word search
Jim Comegys wrote:
>If one of you is computer capable and has a bit of free time, I am doing a
>comparison of duplicates and near duplicates in the VMS, sequences like EVA
>ytchal ytchal and cphor ytchor and the like. It is slow and dull to search
>these things out visually, does anyone have a program to seek out these?
>better yet a list because maybe you have studied the matter?
>While we are on the matter, where is a good list of every Voynich word and
>its frequency? I can no longer access the old Mik Clarke site.
>Thank you very much, and have a good week-end.
>Jim Comegys, Madera, California
I've something similar: I wrote a little awk-script (yes again) which
applied a number to each word (like a hash algorithm) and computed for each
'word' the 'distance' to other. The result was, that a few (~100) words
could easy be misspelling of others (with higher frequency). The only
problem was to define the hash code, which I did visually, i.e 'characters'
with similar look where code with a close number and very different one got
a large number difference.
To compute the hash value for each word difference, I used the formula:
d=sum(sqrt(c1n*c1n-c2n*c2n)) with c1n the nth character of word 1 and c2n
the same for word 2. if d was lower than a certain e , the word with the
higher frequency was chosen.
So, most of the * characters in the transcription could be eliminated.