[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
RE: VMs: split words

To: vms-list@xxxxxxxxxxx
Subject: RE: VMs: split words
From: Jorge Stolfi <stolfi@xxxxxxxxxxxxx>
Date: Fri, 10 Sep 2004 23:06:21 -0300
Reply-to: vms-list@xxxxxxxxxxx
Sender: owner-vms-list@xxxxxxxxxxx
Marke, your statistics on repeated sequences are quite intriguing.
I still haven't had the time to understand what they mean, but 
that sort of thing is certainly something that we should explore.

In the VMS, indeed, spaces tend to occur in certain contexts more
often than in others. That is partly a feature of the language: just
as in English a break is more likely to occur before "th" than between
a "t" and an "h", in Voynichese it is more likely to occur after a "y"
or "n", or before an "o", etc., than between an "a" and an "i". This
preference of word breaks for certain contexts is in fact much
stronger for the VMS than for English, and is another hint that the
VMS is not random garbage.

However, beware that spaces are quite uncertain. Just because all
transcribers agree on "." rather than "," or "", it does not follow
that the space is really a word break. In any language, character
spacing in handwritten block-print text is quite variable; we normally
do not notice the variation because the contents allows us to
distinguish true breaks from accidents in most cases.

In the VMS, spurious extra spaces may occur in certain contexts for
reasons related to the shape of the letters and/or the pen stroke
dynamics. For instance EVA "a" and "r" are usually close to each
other, presumably because of the two consecutive "i" strokes.
Conversely, there is often a wider space after "r" and "s", presumably
because the scribe instinctively tries to stay clear of their plumes
when writing the next glyph. That may be the reason why certain
prefixes like "ar", "al" etc. often occur as separate words in the
transcribed files.

On the other hand, the contexts where line breaks occur (which
presumably are real word boundaries) are quite similar to the contexts
of word spaces; from which we may perhaps conclude that most of the
"word spaces" are indeed word breaks, too.

There is some tantalizing evidence suggesting that Voynichese may be
transliterated Arabic. That theory could provide not one, but two
explanations for those detachable prefixes. First: in Arabic writing,
some letter pairs within a word are always joined, while some are
always separated by a gap, and a few are often replaced by standard
ligatures. So perhaps the VMS author misinterpreted some of the
intra-word gaps for word breaks. Second: When transliterating Arabic
into Western alphabets, some word prefixes are sometimes attached,
sometimes detached: "alf laila wa laila", "wa-laila" or "walaila",
"Al Qaeda", "Aldebaran", etc.

Actually this last feature may occur in European languages as well,
especially before they had their spellings standardized. Thus the
spanish articles "el" and "la" are now generally written as separate
words, but at one point they were often joined to the next word, an
usage that still survives in some names like "elRey" and "Lamarca".
The same probably can be said of French, Italian, and Portuguese.
(Reminds me of Gulliver's Kingdom of Laputa, which is prudently
renamed "Labuta" in Portuguese translations...)

  > [Marke:] If anyone wants me to repeat the experiment with a
  > particular language then please send me a sample

You may wish to try with some of the following files.

In "http://www.ic.unicamp.br/~stolfi/voynich/Notes/101/dat/";

  arab/qcs/tot.1/gud.tlw  Arabic without vowels (The Holy Quran)
  arab/quv/tot.1/gud.tlw  Arabic with vowels (The Holy Quran)
  chip/voa/tot.1/gud.tlw  Mandarin Pinyin with tones (Voice of America)
  engl/twp/tot.1/gud.tlw  1400's Middle English (Towneley Plays)
  engl/cul/her.1/gud.tlw  1600's English (Culpeper's Herbal) [*]
  engl/wow/tot.1/gud.tlw  1800's English (Well's War of the Worlds)
  enrc/wow/tot.1/gud.tlw  English in Roman codebook cipher (ditto)
  envg/wow/tot.1/gud.tlw  English in Vigenère cipher (ditto)
  envt/wow/tot.1/gud.tlw  English word-subst by Vietnamese (ditto)
  fran/tal/tot.1/gud.tlw  1800's French (Verne's De la terre à la lune)
  geez/gok/tot.1/gud.tlw  Ethiopian Ge'ez (Glory of The Kings) [*]
  germ/sim/tot.1/gud.tlw  1600's German (Abenteuer Simplicius Simplicissimus)
  grek/nwt/tot.1/gud.tlw  200's Greek (Byzantine Gospels)
  hebr/tav/tot.1/gud.tlw  Hebrew with vowel marks (Pentateuch)
  hebr/tad/tot.1/gud.tlw  Hebrew without vowel marks (Pentateuch)
  ital/psp/tot.1/gud.tlw  1800's Italian (Manzoni's Promessi Sposi)
  latn/nwt/tot.1/gud.tlw  300's Latin (Vulgate Gospels)
  latn/ptt/tot.1/gud.tlw  300's Latin (Vulgate Pentateuch)
  latn/ock/tot.1/gud.tlw  1300's Latin (Ockam's Dialogus)
  port/csm/tot.1/gud.tlw  Portuguese (Machado de Assis's Dom Casmurro)
  russ/pic/tot.1/gud.tlw  Russian, transliterated (Strugatskys' Roadside Picnic)
  russ/ptt/tot.1/gud.tlw  Church Russian, KOI8-R (Synodal Pentateuch)
  span/qvi/one.1/gud.tlw  1600's Spanish (Cervantes's Don Quixote, Part I)
  span/qvi/two.1/gud.tlw  1600's Spanish (Cervantes's Don Quixote, Part II)
  tibe/pmi/tot.1/gud.tlw  Tibetan (Kyabje Trijang Rinpoche's Mistaken Illusion)
  viep/grs/tot.1/gud.tlw  Pseudo-Vietnamese (by Gordon Rugg's method)
  viep/mky/tot.1/gud.tlw  Pseudo-Vietnamese (by 3rd-order Markov chain)
  viet/ptt/tot.1/gud.tlw  Vietnamese (Cadman Pentateuch)
  voyn/prs/tot.1/gud.tlw  Voynichese (Majority version, prose text only)

These files should be reasonably clean (without numbers,
foreign-language quotations, chapter/section titles, etc.). They were
all trimmed to the same length as the VMS prose-only text (35027
words), except those marked "[*]": engl/cul/her.1 has only 25177
words, geez/gok/tot.1 has 34291. In some cases the trimming was not
continuous; in the New Testament files, for example, I took one chunk
from the beginning of each Gospel, proportional to its length, to make
up the 35027 words.

The format is one token per line, as "TAG LOC WORD" where TAG is either
"#" (comment, to be ignored) or "a" (word), and LOC tells the WORD's
position (chapter, section, verse, etc.) in the original book.
Each file is about 1 MB long.

All words were mapped to lowercase, and there is no punctuation, not
even paragraph delimiters. Beware that most of these files use
non-ascii (but ISO printable) characters, in various ad-hoc encodings.

Enjoy...

--stolfi

PS. If you need more details on those files (bibliographic data,
encoding, cleanup details), just ask -- I have that info written 
up somewhere.

______________________________________________________________________
To unsubscribe, send mail to majordomo@xxxxxxxxxxx with a body saying:
unsubscribe vms-list
Prev by Date: RE: VMs: magical voynich dice game
Next by Date: VMs: magical voynich dice game
Previous by thread: VMs: magical voynich dice game
Next by thread: VMs: Codex Taurinensis
Index(es):
- Date
- Thread