[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

WG: average word length in VMS



> Dear all,
> to compute the "real" word length (as opposite to token length), I wrote a
> small awk-script to compile all char. combinations of the VMS - ignoring
> the token break char. The ouput was reduced to "words" with frequency > 4.
> And only up to folio 101 because my computer became quite slow because the
> needed memory (more than 600 MB). The processing for 1 line was inscreased
> to 1 hour and became expon. bigger with everex next line. 
> The result:
> # of different words with frq. > 4		:	20603 
> average word length of these words	: 	6.88861
> Most of the words differ only in the endings (maybe declinations or
> conjugations).
> It seems, these numbers are more similar to known laguages than the number
> coputed for tokens:
> The next step will looking at the roots of all these words to produce a
> kind of vocabulary.
> Any hints for extracting a root-word out this (like Jorge's
> mantle/crust/core)?
> Claus
> PS words in this context are clusters of characsters within one VMS line
> ignoring token/line/par breaks.

<<attachment: winmail.dat>>