[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
More on label anomalies
Hi again,
Following on the previous, there is some
suspicion that the labels could be names,
adjectives or nouns.
If <q> is something like "and" and <o> is
similar to "the", then the next character after
<o>, <q> or <qo> should be the 1st proper
character of the label.
I had a look at the distribution of characters
by considering the 2nd character if the label
starts with <q>|<o> or the 3rd if it starts with
<qo>.
I got the following surprise(in %)
char words tokens Labels labls_without_q|o|qo
o 18.59 21.92 67.01 0.00
c 17.92 18.22 6.66 8.68
q 11.27 14.13 0.44 0.00
s 10.84 11.99 5.92 7.65
d 7.36 9.59 7.25 8.68
a 4.24 5.78 2.37 3.09
y 7.44 4.93 6.80 7.06
l 4.76 3.67 0.44 4.26
k 4.46 3.40 1.18 23.53
t 4.80 2.71 0.89 24.41
p 3.93 1.45 0.15 4.71
r 1.43 1.27 0.15 2.65
e 1.29 0.40 0.74 3.38
f 1.07 0.32 0.00 1.47
i 0.27 0.07 0.00 0.44
x 0.13 0.05 0.00 0.00
g 0.03 0.04 0.00 0.00
v 0.07 0.03 0.00 0.00
m 0.03 0.03 0.00 0.00
n 0.01 0.01 0.00 0.00
j 0.04 0.01 0.00 0.00
u 0.01 0.00 0.00 0.00
z 0.01 0.00 0.00 0.00
So almost half the labels (if <q> and <o> are
not strictly part of the word) start with <k> or
<t>. Those % are so close...
Is <k>=<t>?
If so, why half the labels start with <k>|<t>?
The fraction of <q|o|qo> starting labels
only(458) has the following distribution of the
2nd or 3rd character.
t 34.72
k 32.97
p 6.77
l 5.68
e 3.93
r 3.71
c 2.84
s 2.40
d 2.18
f 2.18
a 1.09
i 0.66
o 0.44
y 0.44
Again, a <k><t> excess, but there would seem to
be a greater contribution of <k><t> from those
labels starting with <q|o|qo>.
All comments welcome.
Cheers,
Gabriel