> Dear all, > to compute the "real" word length (as opposite to token length), I wrote a > small awk-script to compile all char. combinations of the VMS - ignoring > the token break char. The ouput was reduced to "words" with frequency > 4. > And only up to folio 101 because my computer became quite slow because the > needed memory (more than 600 MB). The processing for 1 line was inscreased > to 1 hour and became expon. bigger with everex next line. > The result: > # of different words with frq. > 4 : 20603 > average word length of these words : 6.88861 > Most of the words differ only in the endings (maybe declinations or > conjugations). > It seems, these numbers are more similar to known laguages than the number > coputed for tokens: > The next step will looking at the roots of all these words to produce a > kind of vocabulary. > Any hints for extracting a root-word out this (like Jorge's > mantle/crust/core)? > Claus > PS words in this context are clusters of characsters within one VMS line > ignoring token/line/par breaks.
<<attachment: winmail.dat>>