[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: VMs: provable?

To: vms-list@xxxxxxxxxxx
Subject: Re: VMs: provable?
From: Jorge Stolfi <stolfi@xxxxxxxxxxxxx>
Date: Wed, 29 Sep 2004 23:49:00 -0300
Reply-to: vms-list@xxxxxxxxxxx
Sender: owner-vms-list@xxxxxxxxxxx

  > [Marke Fincher:] over 6000 of the 8700 VMs
  > words occur only once in the whole manuscript.

  > [Rene:] Is that number correct?

  > [Rene:] How would it be for an English / Latin / German
  > text of the same size?

Here is a quich tabulation for my collection of sample texts, most of 
them truncated to match the number of tokens in the VMS:

  sample    tokens   words  unique  %uniq Source (truncated to VMS-size)
  -------- ------- ------- ------- ------ ------------------------------
  engl/wow   35027    4869    2465  50.63 Well's War of the Worlds
  engl/wnm     831     194     100  51.55 Proper nouns from WotW
  engl/cul   35027    3739    1784  47.71 Culpeper's Herbal
  engl/cpn     541     400     322  80.50 Plant names from Culpeper's
  engl/twp   35027    4202    2225  52.95 Towneley Plays
  latn/ptt   35027    6652    3842  57.76 Vulgate Pentateuch
  latn/nwt   35027    5740    2947  51.34 Vulgate Gospels
  latn/ock   35027    5589    2926  52.35 Ockam's Dialogus
  grek/nwt   35027    5437    2825  51.96 Bizantine Gospels
  span/qvi   70054    8772    5065  57.74 Don Qvixote (old spelling)
  ital/psp   35027    6623    4085  61.68 Manzoni's Promessi Sposi
  fran/tal   35027    6223    3698  59.42 Verne's Terre à la Lune
  port/csm   35027    6283    3785  60.24 Assis's Dom Casmurro
  germ/sim   35027    6826    4223  61.87 Simplicissimus Teutsch
  russ/pic   35027    9761    6659  68.22 Roadside Picnic (phon.)
  russ/ptt   35027    5520    2910  52.72 Synodal Pentateuch (cyr.)
  arab/quf   35027   10935    7353  67.24 Quran (fully marked)
  arab/quv   35027   10762    7187  66.78 Quran (vowels, no sukuns)
  arab/qud   35027    8531    5245  61.48 Quran (consonants only)
  arab/qph   35027    9434    6044  64.07 Quran (ditto, phonetic)
  arab/qcs   35027    9025    5649  62.59 Quran (ditto, alt. file)
  hebr/tav   35027   12640    8548  67.63 Torah (vowels)
  hebr/tad   35027   11856    7842  66.14 Torah (cons. only)
  geez/gok   34291   12272    8344  67.99 Glory of the Kings (SERA)
  geez/eno   17736    6274    4193  66.83 1 Enoch (SERA)
  viet/ptt   35017    1631     397  24.34 Cadman Pentateuch (VIQR)
  viet/nwt   35027    2010     569  28.31 Cath. New Gospels (VIQR)
  chin/ptt   35027    1392     280  20.11 Union Pentateuch (GB)
  chin/ptn   35027    1405     291  20.71 New Pentateuch (GB)
  chin/red   35027    2420     663  27.40 Dream of Red Mansion (GB)
  chin/voa   35027    1616     348  21.53 Voice of America (GB)
  chip/voa   35027     830      98  11.81 Voice of America (pinyin)
  tibe/vim   35027    1300     370  28.46 Vimalakirti Sutra (ACIP)
  tibe/ccv   35027     846     196  23.17 Commentary on CVR (ACIP)
  tibe/pmi   35027    1963     515  26.24 Mistaken Illusion (ACIP)
  chrc/red   35027    2420     663  27.40 DoRM - Roman codebook
  enrc/wow   35027    4869    2465  50.63 WotW - Roman codebook 
  envt/wow   35027    2242     475  21.19 WotW - Viet-substituted
  envg/wow   35027   12911    9127  70.69 WotW - Vigenere coded
  voyp/grs    1950     635     365  57.48 Rugg's pseudo-VMS (sfw)
  voyp/grm     708     307     204  66.45 Rugg's pseudo-VMS (man)
  viep/grs   31200    7760    3216  41.44 Rugg style pseudo-Vietnamese
  viep/mky   35027    3341    1174  35.14 Monkey pseudo-Vietnamese

I have excluded numbers, punctuation, unreadable words, etc. 
Upper case letters were mapped to lower case.

Note that Russian, German, Romance and Afroasiatic ("Semitic")
languages indeed tend to have larger vocabularies (6000--12000 words
in 35000 tokens) and a higher percentage of unique words (57% to 68%),
whereas these numbers are lower for English (under 5000 and ~50%,
respectively. However, inflections cannot be the only reason for the
unique word statistic, since this numerical gap is spanned by
different books in the same language. Compare russ/pic (Russian novel)
with russ/ptt (Russian bible), or latn/ptt (Vulgate Pentateuch) with
latn/nwt (Vulgate Gospels).

East Asian samples have vey few unique words, especially chip/voa
which is in phonetic encoding (pinyin) and hence does not distinguish
homonyms. It should be noted however that in all these samples
spelling was very consistent and each syllable was treated as a single
word (i.e. compound words were split and prefixes/suffixes were detached).

Not surprisingly, sample envt/wow (English text where each distinct
word was replaced by a distinct combination of one or two Vietnamese
words) has the same low counts as Vietnamese; whereas envg/wow (the
same English text encrypted with a Vigenère cipher) holds the record
for the percentage of unique words (70%)

All the best,

--stolfi
______________________________________________________________________
To unsubscribe, send mail to majordomo@xxxxxxxxxxx with a body saying:
unsubscribe vms-list

Follow-Ups:
- Re: VMs: provable?
  - From: Koontz John E

Prev by Date: Re: VMs: Grove words
Next by Date: Re: VMs: provable?
Previous by thread: Re: VMs: Grove words
Next by thread: Re: VMs: provable?
Index(es):
- Date
- Thread