[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Labels o and q.



Hi all,
As you may remember, somebody (?) noted that there is a tendency of the labels to start with <o>.
I just had a look at the 676 labels which had no unambiguous first characters (the total number of labels according to my count - which may not be correct and has not been double checked- is 684).
I also counted the distribution of first characters in tokens (again those with unambiguous characters) and subtracted the label counts to obtain a label-less token count.

The distribution (in %) of 1st characters is as follows:
(sorted descending by the frequency of tokens).

Char Tokens Labels
o 21.92 67.01
c 18.22 6.66
q 14.13 0.44
s 11.99 5.92
d 9.59 7.25
a 5.78 2.37
y 4.93 6.80
l 3.67 0.44
k 3.40 1.18
t 2.71 0.89
p 1.45 0.15
r 1.27 0.15
e 0.40 0.74
f 0.32 0.
i 0.07 0.
x 0.05 0.
g 0.04 0.
v 0.03 0.
m 0.03 0.
n 0.01 0.
j 0.01 0.
u 0.00 0.
z 0.00 0.

The two largest differences seem to be the excess of <o> and the lack of <q> as initial characters.

I remember that it has been suggested before that <o> may be an article preceding a noun. This could well be the case.

I also remember that <q> has been suggested to act as "and" or "&" joining from the previous line. The lack of <q> in the labels (only 3 have them) seems to fit nice with that idea too.

Also <c> appears less than expected; I had a look at the labels they are all <ch>+something except 2 labels which start
with <cph>.

I am not sure whether the following means anything, but if we leave out those labels starting with <o> and <q>, then the distribution seems to get a bit closer to that of the tokens. Note that by doing this I ignored
67% of the labels...


char tokens Labels
c 28.48 20.45
s 18.75 18.18
d 14.99 22.27
a 9.03 7.27
y 7.71 20.91
l 5.73 1.36
k 5.32 3.64
t 4.23 2.73
p 2.27 0.45
r 1.98 0.45
e 0.63 2.27
...

Still there are too many <y>, too few <l>, etc..
A few cells are empty or <5 items in it to perform a chi-squared test with confidence...

What I should have done (and will do later) is to count the labels by the 2nd letter if they start with <o> or <q>, and look at the result {i.e do they produce valid voynich words?} but this brings another problem as some labels have ambiguous 2nd character.
Anyway I thought that what I found so far would be of interest.
All comments are welcome.

Regards,
Gabriel