[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: VMs: truncated repeating sequences



On Thursday 09 September 2004 15:07, Marke Fincher wrote:
> The next step is to see if >99% of the VMs can be created by pasting
> together decent sized chunks taken from a small set of "master sequences". 

The answer is very likely to be "no" because of the word frequency 
distribution.
Note that the procedure will have to fit (somehow) all the words that appear 
once or twice in the entire ms. The number of those words is larger than 1% 
so there is no chance that 99% of the ms. is produced with other repeated 
sequences. 

I just had a look and words appearing once are about 14% of the corpus.

It would be also useful to take a look at Stolfi's concordance lists to see to 
which extent are the repetitions common. 

I would also say that it is important to test the same algorithm with other 
data (i.e. other languages and word-scrambled texts). A sample of n=1 will 
not be very convincing as we do not know relevant the effect may be. 
Worth doing, though. I would be interested to know how common this effect is 
in real languages.
I am sure that lots of repetitions are found in Askham's herbal. If I remember 
correctly, most plants descriptions start "This herbe is called ...". 

Cheers,

Gabriel




______________________________________________________________________
To unsubscribe, send mail to majordomo@xxxxxxxxxxx with a body saying:
unsubscribe vms-list