[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
RE: VMs: split words
Marke, your statistics on repeated sequences are quite intriguing.
I still haven't had the time to understand what they mean, but
that sort of thing is certainly something that we should explore.
In the VMS, indeed, spaces tend to occur in certain contexts more
often than in others. That is partly a feature of the language: just
as in English a break is more likely to occur before "th" than between
a "t" and an "h", in Voynichese it is more likely to occur after a "y"
or "n", or before an "o", etc., than between an "a" and an "i". This
preference of word breaks for certain contexts is in fact much
stronger for the VMS than for English, and is another hint that the
VMS is not random garbage.
However, beware that spaces are quite uncertain. Just because all
transcribers agree on "." rather than "," or "", it does not follow
that the space is really a word break. In any language, character
spacing in handwritten block-print text is quite variable; we normally
do not notice the variation because the contents allows us to
distinguish true breaks from accidents in most cases.
In the VMS, spurious extra spaces may occur in certain contexts for
reasons related to the shape of the letters and/or the pen stroke
dynamics. For instance EVA "a" and "r" are usually close to each
other, presumably because of the two consecutive "i" strokes.
Conversely, there is often a wider space after "r" and "s", presumably
because the scribe instinctively tries to stay clear of their plumes
when writing the next glyph. That may be the reason why certain
prefixes like "ar", "al" etc. often occur as separate words in the
On the other hand, the contexts where line breaks occur (which
presumably are real word boundaries) are quite similar to the contexts
of word spaces; from which we may perhaps conclude that most of the
"word spaces" are indeed word breaks, too.
There is some tantalizing evidence suggesting that Voynichese may be
transliterated Arabic. That theory could provide not one, but two
explanations for those detachable prefixes. First: in Arabic writing,
some letter pairs within a word are always joined, while some are
always separated by a gap, and a few are often replaced by standard
ligatures. So perhaps the VMS author misinterpreted some of the
intra-word gaps for word breaks. Second: When transliterating Arabic
into Western alphabets, some word prefixes are sometimes attached,
sometimes detached: "alf laila wa laila", "wa-laila" or "walaila",
"Al Qaeda", "Aldebaran", etc.
Actually this last feature may occur in European languages as well,
especially before they had their spellings standardized. Thus the
spanish articles "el" and "la" are now generally written as separate
words, but at one point they were often joined to the next word, an
usage that still survives in some names like "elRey" and "Lamarca".
The same probably can be said of French, Italian, and Portuguese.
(Reminds me of Gulliver's Kingdom of Laputa, which is prudently
renamed "Labuta" in Portuguese translations...)
> [Marke:] If anyone wants me to repeat the experiment with a
> particular language then please send me a sample
You may wish to try with some of the following files.
arab/qcs/tot.1/gud.tlw Arabic without vowels (The Holy Quran)
arab/quv/tot.1/gud.tlw Arabic with vowels (The Holy Quran)
chip/voa/tot.1/gud.tlw Mandarin Pinyin with tones (Voice of America)
engl/twp/tot.1/gud.tlw 1400's Middle English (Towneley Plays)
engl/cul/her.1/gud.tlw 1600's English (Culpeper's Herbal) [*]
engl/wow/tot.1/gud.tlw 1800's English (Well's War of the Worlds)
enrc/wow/tot.1/gud.tlw English in Roman codebook cipher (ditto)
envg/wow/tot.1/gud.tlw English in Vigenère cipher (ditto)
envt/wow/tot.1/gud.tlw English word-subst by Vietnamese (ditto)
fran/tal/tot.1/gud.tlw 1800's French (Verne's De la terre à la lune)
geez/gok/tot.1/gud.tlw Ethiopian Ge'ez (Glory of The Kings) [*]
germ/sim/tot.1/gud.tlw 1600's German (Abenteuer Simplicius Simplicissimus)
grek/nwt/tot.1/gud.tlw 200's Greek (Byzantine Gospels)
hebr/tav/tot.1/gud.tlw Hebrew with vowel marks (Pentateuch)
hebr/tad/tot.1/gud.tlw Hebrew without vowel marks (Pentateuch)
ital/psp/tot.1/gud.tlw 1800's Italian (Manzoni's Promessi Sposi)
latn/nwt/tot.1/gud.tlw 300's Latin (Vulgate Gospels)
latn/ptt/tot.1/gud.tlw 300's Latin (Vulgate Pentateuch)
latn/ock/tot.1/gud.tlw 1300's Latin (Ockam's Dialogus)
port/csm/tot.1/gud.tlw Portuguese (Machado de Assis's Dom Casmurro)
russ/pic/tot.1/gud.tlw Russian, transliterated (Strugatskys' Roadside Picnic)
russ/ptt/tot.1/gud.tlw Church Russian, KOI8-R (Synodal Pentateuch)
span/qvi/one.1/gud.tlw 1600's Spanish (Cervantes's Don Quixote, Part I)
span/qvi/two.1/gud.tlw 1600's Spanish (Cervantes's Don Quixote, Part II)
tibe/pmi/tot.1/gud.tlw Tibetan (Kyabje Trijang Rinpoche's Mistaken Illusion)
viep/grs/tot.1/gud.tlw Pseudo-Vietnamese (by Gordon Rugg's method)
viep/mky/tot.1/gud.tlw Pseudo-Vietnamese (by 3rd-order Markov chain)
viet/ptt/tot.1/gud.tlw Vietnamese (Cadman Pentateuch)
voyn/prs/tot.1/gud.tlw Voynichese (Majority version, prose text only)
These files should be reasonably clean (without numbers,
foreign-language quotations, chapter/section titles, etc.). They were
all trimmed to the same length as the VMS prose-only text (35027
words), except those marked "[*]": engl/cul/her.1 has only 25177
words, geez/gok/tot.1 has 34291. In some cases the trimming was not
continuous; in the New Testament files, for example, I took one chunk
from the beginning of each Gospel, proportional to its length, to make
up the 35027 words.
The format is one token per line, as "TAG LOC WORD" where TAG is either
"#" (comment, to be ignored) or "a" (word), and LOC tells the WORD's
position (chapter, section, verse, etc.) in the original book.
Each file is about 1 MB long.
All words were mapped to lowercase, and there is no punctuation, not
even paragraph delimiters. Beware that most of these files use
non-ascii (but ISO printable) characters, in various ad-hoc encodings.
PS. If you need more details on those files (bibliographic data,
encoding, cleanup details), just ask -- I have that info written
To unsubscribe, send mail to majordomo@xxxxxxxxxxx with a body saying: