[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Repeated words
> [Jim Comegys:] I am doing a comparison of duplicates and near
> duplicates in the VMS ... does anyone have a program to seek out
> these? Or better yet a list because maybe you have studied the
> matter?
What programming languages do you have available on your system (C,
pascal, java, other)? I do most of my voynich-related programming in
AWK, alanguage that comes with the Unix(or Linux) operating system and
is rather handy for text-processing stuff. I don't have what you are
asking for, but, if you can use it, I have plenty of word-counting and
"fuzzy" word comparison code that may help.
> While we are on the matter, where is a good list of every
> Voynich word and its frequency? I can no longer access the old
> Mik Clarke site.
Look in
http://www.ic.unicamp.br/~stolfi/voynich/Notes/100/lang/
The files you want are
voyt/tot.n/raw.wds = the whole text, one word per line
voyt/tot.n/gud.wfr = counts and frequencies for the "good" words
voyl/tot.n/* = ditto, labels only
voyn/tot.n/* = ditto, non-label text only
The text is a majority-vote combination of several transcriptions.
The "good words" exclude words containing unreadable or contentious
characters, or any "weird" characters (basically those that occur
less than 30 times in the whole book).
There are also similar files per section, e.g.
voyt/bio.1/gud.wfr = counts and freqs for the biological section
Hope it helps. All the best,
--stolfi