[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

More on label anomalies



Hi again,
Following on the previous, there is some 
suspicion that the labels could be names, 
adjectives or nouns.
If <q> is something like "and" and <o> is 
similar to "the", then the next character after 
<o>, <q> or <qo> should be the 1st proper 
character of the label.
I had a look at the distribution of characters 
by considering the 2nd character if the label 
starts with <q>|<o> or the 3rd if it starts with 
<qo>.
I got the following surprise(in %)

char words  tokens  Labels  labls_without_q|o|qo
o    18.59   21.92  67.01   0.00
c    17.92   18.22   6.66   8.68
q    11.27   14.13   0.44   0.00
s    10.84   11.99   5.92   7.65
d     7.36    9.59   7.25   8.68
a     4.24    5.78   2.37   3.09
y     7.44    4.93   6.80   7.06
l     4.76    3.67   0.44   4.26
k     4.46    3.40   1.18  23.53
t     4.80    2.71   0.89  24.41
p     3.93    1.45   0.15   4.71
r     1.43    1.27   0.15   2.65
e     1.29    0.40   0.74   3.38
f     1.07    0.32   0.00   1.47
i     0.27    0.07   0.00   0.44
x     0.13    0.05   0.00   0.00
g     0.03    0.04   0.00   0.00
v     0.07    0.03   0.00   0.00
m     0.03    0.03   0.00   0.00
n     0.01    0.01   0.00   0.00
j     0.04    0.01   0.00   0.00
u     0.01    0.00   0.00   0.00
z     0.01    0.00   0.00   0.00

So almost half the labels (if <q> and <o> are 
not strictly part of the word) start with <k> or 
<t>. Those % are so close...
Is <k>=<t>?
If so, why half the labels start with <k>|<t>?

The fraction of <q|o|qo> starting labels 
only(458) has the following distribution of the 
2nd or 3rd character.		

t  34.72
k  32.97
p  6.77
l  5.68
e  3.93
r  3.71
c  2.84
s  2.40
d  2.18
f  2.18
a  1.09
i  0.66
o  0.44
y  0.44

Again, a <k><t> excess, but there would seem to 
be a greater contribution of <k><t> from those 
labels starting with <q|o|qo>.
All comments welcome.
Cheers,

Gabriel