[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: VMs: provable?
> [Marke Fincher:] over 6000 of the 8700 VMs
> words occur only once in the whole manuscript.
> [Rene:] Is that number correct?
> [Rene:] How would it be for an English / Latin / German
> text of the same size?
Here is a quich tabulation for my collection of sample texts, most of
them truncated to match the number of tokens in the VMS:
sample tokens words unique %uniq Source (truncated to VMS-size)
-------- ------- ------- ------- ------ ------------------------------
engl/wow 35027 4869 2465 50.63 Well's War of the Worlds
engl/wnm 831 194 100 51.55 Proper nouns from WotW
engl/cul 35027 3739 1784 47.71 Culpeper's Herbal
engl/cpn 541 400 322 80.50 Plant names from Culpeper's
engl/twp 35027 4202 2225 52.95 Towneley Plays
latn/ptt 35027 6652 3842 57.76 Vulgate Pentateuch
latn/nwt 35027 5740 2947 51.34 Vulgate Gospels
latn/ock 35027 5589 2926 52.35 Ockam's Dialogus
grek/nwt 35027 5437 2825 51.96 Bizantine Gospels
span/qvi 70054 8772 5065 57.74 Don Qvixote (old spelling)
ital/psp 35027 6623 4085 61.68 Manzoni's Promessi Sposi
fran/tal 35027 6223 3698 59.42 Verne's Terre à la Lune
port/csm 35027 6283 3785 60.24 Assis's Dom Casmurro
germ/sim 35027 6826 4223 61.87 Simplicissimus Teutsch
russ/pic 35027 9761 6659 68.22 Roadside Picnic (phon.)
russ/ptt 35027 5520 2910 52.72 Synodal Pentateuch (cyr.)
arab/quf 35027 10935 7353 67.24 Quran (fully marked)
arab/quv 35027 10762 7187 66.78 Quran (vowels, no sukuns)
arab/qud 35027 8531 5245 61.48 Quran (consonants only)
arab/qph 35027 9434 6044 64.07 Quran (ditto, phonetic)
arab/qcs 35027 9025 5649 62.59 Quran (ditto, alt. file)
hebr/tav 35027 12640 8548 67.63 Torah (vowels)
hebr/tad 35027 11856 7842 66.14 Torah (cons. only)
geez/gok 34291 12272 8344 67.99 Glory of the Kings (SERA)
geez/eno 17736 6274 4193 66.83 1 Enoch (SERA)
viet/ptt 35017 1631 397 24.34 Cadman Pentateuch (VIQR)
viet/nwt 35027 2010 569 28.31 Cath. New Gospels (VIQR)
chin/ptt 35027 1392 280 20.11 Union Pentateuch (GB)
chin/ptn 35027 1405 291 20.71 New Pentateuch (GB)
chin/red 35027 2420 663 27.40 Dream of Red Mansion (GB)
chin/voa 35027 1616 348 21.53 Voice of America (GB)
chip/voa 35027 830 98 11.81 Voice of America (pinyin)
tibe/vim 35027 1300 370 28.46 Vimalakirti Sutra (ACIP)
tibe/ccv 35027 846 196 23.17 Commentary on CVR (ACIP)
tibe/pmi 35027 1963 515 26.24 Mistaken Illusion (ACIP)
chrc/red 35027 2420 663 27.40 DoRM - Roman codebook
enrc/wow 35027 4869 2465 50.63 WotW - Roman codebook
envt/wow 35027 2242 475 21.19 WotW - Viet-substituted
envg/wow 35027 12911 9127 70.69 WotW - Vigenere coded
voyp/grs 1950 635 365 57.48 Rugg's pseudo-VMS (sfw)
voyp/grm 708 307 204 66.45 Rugg's pseudo-VMS (man)
viep/grs 31200 7760 3216 41.44 Rugg style pseudo-Vietnamese
viep/mky 35027 3341 1174 35.14 Monkey pseudo-Vietnamese
I have excluded numbers, punctuation, unreadable words, etc.
Upper case letters were mapped to lower case.
Note that Russian, German, Romance and Afroasiatic ("Semitic")
languages indeed tend to have larger vocabularies (6000--12000 words
in 35000 tokens) and a higher percentage of unique words (57% to 68%),
whereas these numbers are lower for English (under 5000 and ~50%,
respectively. However, inflections cannot be the only reason for the
unique word statistic, since this numerical gap is spanned by
different books in the same language. Compare russ/pic (Russian novel)
with russ/ptt (Russian bible), or latn/ptt (Vulgate Pentateuch) with
latn/nwt (Vulgate Gospels).
East Asian samples have vey few unique words, especially chip/voa
which is in phonetic encoding (pinyin) and hence does not distinguish
homonyms. It should be noted however that in all these samples
spelling was very consistent and each syllable was treated as a single
word (i.e. compound words were split and prefixes/suffixes were detached).
Not surprisingly, sample envt/wow (English text where each distinct
word was replaced by a distinct combination of one or two Vietnamese
words) has the same low counts as Vietnamese; whereas envg/wow (the
same English text encrypted with a Vigenère cipher) holds the record
for the percentage of unique words (70%)
All the best,
--stolfi
______________________________________________________________________
To unsubscribe, send mail to majordomo@xxxxxxxxxxx with a body saying:
unsubscribe vms-list