[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Labels o and q.

Great Stuff!
If 'o' is a definite article, could 'q' preceding 'o' be a plural form. Thus La/Le and Les prefixed nouns. This could explain why there are more occurrences of 'qo' in tokens than in labels. You are more likely to list 'single' items, but will refer to them in the plural when talking about them in a general sense.
Initial 'c'  or 'e' is possibly over-ridden by a 'ch' in word - or perhaps syllable-initial position.
----- Original Message -----
Sent: Sunday, March 04, 2001 10:00 AM
Subject: Labels o and q.

Hi all,
As you may remember, somebody (?) noted that there is a tendency of the labels to start with <o>.
I just had a look at the 676 labels which had no unambiguous first characters (the total number of labels according to my count - which may not be correct and has not been double checked- is 684).
I also counted the distribution of first characters in tokens (again those with unambiguous characters) and subtracted the label counts to obtain a label-less token count.

The distribution (in %) of 1st characters is as follows:
(sorted descending by the frequency of tokens).

Char Tokens Labels
o 21.92 67.01
c 18.22 6.66
q 14.13 0.44
s 11.99 5.92
d 9.59 7.25
a 5.78 2.37
y 4.93 6.80
l 3.67 0.44
k 3.40 1.18
t 2.71 0.89
p 1.45 0.15
r 1.27 0.15
e 0.40 0.74
f 0.32 0.
i 0.07 0.
x 0.05 0.
g 0.04 0.
v 0.03 0.
m 0.03 0.
n 0.01 0.
j 0.01 0.
u 0.00 0.
z 0.00 0.

The two largest differences seem to be the excess of <o> and the lack of <q> as initial characters.

I remember that it has been suggested before that <o> may be an article preceding a noun. This could well be the case.

I also remember that <q> has been suggested to act as "and" or "&" joining from the previous line. The lack of <q> in the labels (only 3 have them) seems to fit nice with that idea too.

Also <c> appears less than expected; I had a look at the labels they are all <ch>+something except 2 labels which start
with <cph>.

I am not sure whether the following means anything, but if we leave out those labels starting with <o> and <q>, then the distribution seems to get a bit closer to that of the tokens. Note that by doing this I ignored 67% of the labels...

char tokens Labels
c 28.48 20.45
s 18.75 18.18
d 14.99 22.27
a 9.03 7.27
y 7.71 20.91
l 5.73 1.36
k 5.32 3.64
t 4.23 2.73
p 2.27 0.45
r 1.98 0.45
e 0.63 2.27

Still there are too many <y>, too few <l>, etc..
A few cells are empty or <5 items in it to perform a chi-squared test with confidence...

What I should have done (and will do later) is to count the labels by the 2nd letter if they start with <o> or <q>, and look at the result {i.e do they produce valid voynich words?} but this brings another problem as some labels have ambiguous 2nd character.
Anyway I thought that what I found so far would be of interest.
All comments are welcome.