[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: VMs: truncated repeating sequences
On Thursday 09 September 2004 15:07, Marke Fincher wrote:
> The next step is to see if >99% of the VMs can be created by pasting
> together decent sized chunks taken from a small set of "master sequences".
The answer is very likely to be "no" because of the word frequency
distribution.
Note that the procedure will have to fit (somehow) all the words that appear
once or twice in the entire ms. The number of those words is larger than 1%
so there is no chance that 99% of the ms. is produced with other repeated
sequences.
I just had a look and words appearing once are about 14% of the corpus.
It would be also useful to take a look at Stolfi's concordance lists to see to
which extent are the repetitions common.
I would also say that it is important to test the same algorithm with other
data (i.e. other languages and word-scrambled texts). A sample of n=1 will
not be very convincing as we do not know relevant the effect may be.
Worth doing, though. I would be interested to know how common this effect is
in real languages.
I am sure that lots of repetitions are found in Askham's herbal. If I remember
correctly, most plants descriptions start "This herbe is called ...".
Cheers,
Gabriel
______________________________________________________________________
To unsubscribe, send mail to majordomo@xxxxxxxxxxx with a body saying:
unsubscribe vms-list