[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: WG: average word length in VMS
> [Claus Anders] to compute the "real" word length (as opposite to
> token length), I wrote a small awk-script to compile all char.
> combinations of the VMS - ignoring the token break char. [...]
> PS words in this context are clusters of characsters within one
> VMS line ignoring token/line/par breaks.
I don't quite understand what you mean. Are your words the same thing
as our words (i.e. strings of non-break characters delimited by
breaks)? Or do you discard the break characters and then take all
possible substrings of each line (i.e. from column i to column j, for
all pairs 1 <= i < j <= n)?
> The result:
> # of different words with frq. > 4 : 20603
> average word length of these words : 6.88861
Does the latter take into account the frequency of each word? Or is it
the mean length of the *set* of distinct words, without regards to
their frequencies?
> The ouput was reduced to "words" with frequency > 4. And only up
> to folio 101 because my computer became quite slow because the
> needed memory (more than 600 MB). The processing for 1 line was
> inscreased to 1 hour and became expon. bigger with everex next
> line.
What computing system do you use (Windows, Unix)? If it is Unix,
I may be able to help you to make your program faster.
> Any hints for extracting a root-word out this (like Jorge's
> mantle/crust/core)?
One method is to build a finite automaton that recognizes the set of
those words, and then look for "important" states that are used by
many words. (In fact, my debut as Voynichologist was applying this
technique to the list of standard (space-delimited) VMs words. It did
reveal some roots/suffix "declinations", but unfortunately they had
all been noticed before.)
As for the core/mantle/crust stuff, I didn't use any special method.
>From poring at the digraph and trigraph frequencies I eventually got
convinced that the <e>s were modifying suffixes for a certain subset
of the symbols (gallows and tables), and the [aoy] seemed to be
inserted at specific positions within the word. So I deleted the
[aoy], regrouped the remaining symbos into X/Xe "letters", and again
looked at their digraph and trigraph frequencies. At that level, the
three-layer structure wasn't hard to see.
All the best,
--stolfi