[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
VMs: Grove words (was: A very important discovery)
> [Jose Rodriguez:] ... another program,which determines the
> percentage of different words respect to the total of the words
> that have been appeared during an interval of lines that is
> indicated by the user.
As in many VMS studies, let's say that a "word" is an abstract
sequence of letters, and a "token" is an occurrence of a word in a
text. So, for example in the sentence "the man can open the can"
there are six tokens but only four words.
So I presume that your graphs show
f(N) = 100*(number of words in the first N tokens)/N
for the original text and for the line-sorted version.
> But my surprise was enormous when I looked for the lines of the
> text which corresponded the peaks. Accurately, I observed that the
> three main crests corresponded to the lines [205-220], [290-335]
> and [595-680], which corresponded with the lines that begin by the
> gallows f([205-219]), k([286-334]) and p([593-681]), respectively!
The graphs are new, but the explanation for those peaks in the
line-sorted graph may be a discovery that John Grove made several
years ago: on many lines, the first word looks like a "normal" word
with an extra gallows attached at the beginning.
IIRC, the evidence that those initial gallows are not really part of
the word includes: (1) many of those words are fairly rare, but if one
removes those "detachable gallows" (as John called them), one often
obtains a relatively common word. (2) words that start with a gallows
letter are more common at the beginning of lines than elsewhere; (3)
those words often have two gallows, which is a fairly rare feature of
About item (3): to be precise, in some trenscription of the VMS there
are ~930 tokens (in 35,000, or 2.6%) that do not fit my word grammar
because they violate the three-layer (crust-mantle-core-mantle-crust)
rule. Most of these words occur only once in the text. Of those ~930
tokens, ~210 look like Grove words in that they start with a gallows,
and by removing that gallows one obtains a word that fits my word
grammar. (Most of the remaining ~720 anomalies could be pairs of words
that were run together).
I have not checked whether those ~210 "Grove-like" tokens occur at the
beginning of lines or not. They seem to occur at about the same rate
in most sections, but twice as often in Biological, and hardly at all
in Astro/Cosmo. None of them occur as labels.
So I guess that the peaks in your graph correspond to the lines that
start with Grove words. Since Grove words generally occur only once in
the text, it is expected that you get more distinct words in those
sections of your sorted file than elsewhere.
The big question is what the Grove Words mean. Apart from
cryptographic devices, they could be separators (like the reversed "P"
sign that scribes used to separate paragraphs). Or they could be tags
indicating "fields" in a "form", e.g.
N[ame] witches weed
G[rows] meadows, fields
S[eason] late spring gather under full moon
T[aste] bitter with a twist of lemon
U[ses] cures headaches unclogs drains removes price tags
G very rare except where hobbits meet ringwraiths
S episode one
U prevents premature end of plot
Or character tags, in a play or dialogue:
D[iscipulus]: and how do we know that the moon is not made of cheese
M[agister]: well you don't see mice howling at it do you
D are we then to conclude that the moon is a flying sheep
M either that or the abode of three little pigs
All the best,
To unsubscribe, send mail to majordomo@xxxxxxxxxxx with a body saying: