[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
RE: VMs: provable?
Oops - I should have read ahead. But this is kind of what I was
expecting. We think in terms of words as the abstract stem or citation
form (dictionary entry), but in a text's language(s), the actual tokens in
which the word can occur range over the inflected forms of the word, which
may include several different cases and/or numbers (for nouns and
adjectives) or moods, tenses, persons, and numbers (for verbs). I'm using
typical European categories here. So then obviously a more highly
inflected language will have have a greater variety of potential forms
from which to draw the tokens in which a particular word can occur. So
it's more likely that words used in a text of a more highly inflected
language will occur in a unique inflectional form, even if the word itself
occurs in other forms in the text.
For example, the Latin word stella can occur as unique tokens drawn from
the set of forms stella, stellae, stellam, stellarum, stellis, stellas
(suppressing variants for vowel length) while English star has only star
and stars. What holds here for nominal paradigms holds in spades for
verbal ones (between English, even conservative KJE, and Latin or French
or German). In re KJE vs. modern NE, add eatest and substitute eateth for
eats in the set eats, eats, ate, eaten, eating, at which point we've
reached the number of forms in particular tense in a typical inflected
European language (3 persons x 2 numbers).
If you think of the different number of tokens appearing in "Puer ille
puerum illum videbat." vs. "The boy saw the boy." you get the drift.
(Note - I had to wing it on the Latin, and I know I cheated by using
demonstartives in lieu of the article.)
John E. Koontz
http://spot.colorado.edu/~koontz
On Wed, 29 Sep 2004, Brian Tawney wrote:
> Voynich Manuscript: 8453 distinct words, of which 71% occur only once
> Talmud (no vowels): 18801 distinct words, of which 61% occur only once
> Vulgate Bible: 41849 distinct words, of which 47.52% occur only once
> Qur'an (no vowels): 14769 distinct words, of which 58.28% occur only once
>
> In each case I treated a "word" as being a space-delimited token, so Arabic
> "bismillah" is treated as one word (bsm'llh) instead of three (b-ismi-llah),
> and Hebrew "et-ha-aretz" is treated as one word instead of three, and so on.
>
> The VM is slightly high in terms of unique words...but not really so
> extraordinarily high, especially when compared to inflected/affixal
> languages.
>
> Brian Tawney
______________________________________________________________________
To unsubscribe, send mail to majordomo@xxxxxxxxxxx with a body saying:
unsubscribe vms-list