Hi all,
I replaced all ii with I and all ee with E and getting now the following peak frequency for char distance:
Freq. Dist Char. pair
0.631167 7 e-e
0.660071 6 e-e
0.0446923 6 i-i
0.0478532 7 i-i
0.0775552 7 I-I
0.121701 6 I-I
0.209564 7 E-E
0.21246 6 E-E
1.61051 7 o-o
1.70724 6 o-o
IMHO this feature suggests the following:
ii and ee are single characters (the same pattern now for the other chars)
6.5 is the average token length (=number of chars between two repetions of the same character)
char tend to occur in the same position within in token.
(the relative frequency is the quotient of all occurences of a pair with the given distance divided by the number of all occurences of the same distance).
Replacing the ii/ee pairs with one character changes the proportions of the token structure.Now there is no exception for average distance of char pairs.
Claus