[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
AW: Counting the Gallow Bits
I did this, but the resulting numbers where quite similar only differing in
the 1st decimal.
This behaviour I have anticipated ,because in lines shorter than 7 tokens
the distribution ist a good mixture of 01 sequences (in fact all numbers
from 0x00 to 0xff exist, but some are of cource more frequent). But in
longer lines the tendency to clustering of 0 or 1 becomes significant.
Today I had the idea to make a bitmap of the VMS:
1. For every char in VMS I compute the frequncy and assign a color to the
char depending on frequency: from red (low frequncy) to blue (high
frequency). Than map every page to this colouring scheme using e.g a 10
pages by 10 pages grid. For the whole VMS I will get around 2 Bitmaps. Myabe
the image could reveal some structure (Courier A and B or something like
that).
2. Then do the same for char pairs or triplets
3. At last the same for tokens
Do you think this is futile?
Cheers
Claus
-----Ursprungliche Nachricht-----
Von: Gabriel Landini [mailto:G.Landini@xxxxxxxxxx]
Gesendet: Mittwoch, 13. September 2000 09:58
An: Claus Anders
Betreff: Re: Counting the Gallow Bits
On 12 Sep 2000, at 20:18, Claus Anders wrote:
> yes I counted the labels too, but they're even long lines (11 tokens
> or less) with 100% 1 coverage.
Well, those statistics will mix the rules of word construction with
those of grammar.
I would count only those in text lines but not labels since a sequence
of labels may not have any grammatical structure.
Cheers,
Gabriel