[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: VMs: provable?

On Wed, 29 Sep 2004, Jorge Stolfi wrote:
> Here is a quich tabulation for my collection of sample texts, most of
> them truncated to match the number of tokens in the VMS:

Very interesting!  I suspect "simpler" text, e.g., maybe American
newspaper text (supposed to be aimed at folks with c. a 6th grade
educaition) would have fewer unique words, modulo any tendency to hapax
legomena in newspaper text.  The Gospels might be "simpler" than the
Pentateuch, as they were written in a second language (for many of the
writers - even thinking historically as opposed to fundamentally) for
instructional use, whereas I think the Pentateuch is a more literary text,
and this may be reflected even in translations.

If anyone is interested I could provide Omaha or Dakota text.  These are
not "polysynthetic," but do tend to be fairly highly synthetic.
Conventions on word divisions are somewhat flexible.  People tend to write
more or less conventional - and consistently, but arbitrarily divided -
bundles of enclitics separately after the main word.

> Not surprisingly, sample envt/wow (English text where each distinct
> word was replaced by a distinct combination of one or two Vietnamese
> words) has the same low counts as Vietnamese; whereas envg/wow (the
> same English text encrypted with a Vigenère cipher) holds the record
> for the percentage of unique words (70%)

I suspect you'd get results similar to Vietnamese if you divided English
text into syllables, too.  It might be easier to do this mechanically with
a Romance language (other than maybe French), though.

To unsubscribe, send mail to majordomo@xxxxxxxxxxx with a body saying:
unsubscribe vms-list