Re: VMs: truncated repeating sequences

On Thu, 9 Sep 2004, Gabriel Landini wrote:
> The answer is very likely to be "no" because of the word frequency
> distribution.
> Note that the procedure will have to fit (somehow) all the words that appear
> once or twice in the entire ms. The number of those words is larger than 1%
> so there is no chance that 99% of the ms. is produced with other repeated
> sequences.
> I just had a look and words appearing once are about 14% of the corpus.

I think Marke is compressing spaces out in his analysis, so one would need
to make sure that unique words didn't arise from unique spacing.
However, I was wondering the same thing.  It's analogous to accounting for
the labels.  (One disadvantage of approaching the VMs. via transcription
files is that you tend to lose the perception of the labels as such.)

