[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: 8200 Voynichese Words
- To: Dennis <ixohoxi@xxxxxxxxxxxxx>
- Subject: Re: 8200 Voynichese Words
- From: Jorge Stolfi <stolfi@xxxxxxxxxxxxx>
- Date: Sat, 9 Dec 2000 22:34:44 -0200 (EDT)
- Cc: voynich@xxxxxxxx
- Delivered-to: reeds@research.att.com
- In-reply-to: <3A27AC06.59D6D8C0@micro-net.com>
- References: <BPEOIKLPOIDECCHIOEMCGEEGCAAA.Claus_Anders@t-online.de> <39D02EA1.6011D2B3@mail.msen.com> <39D09ABF.DF482B07@gte.net> <39D0B94D.55710F91@alphalink.com.au> <200009270209.e8R29xC02004@coruja.dcc.unicamp.br> <39D19E14.7273E51C@alphalink.com.au> <200010090043.e990hVB29746@coruja.dcc.unicamp.br> <39E27A1C.E18F190@micro-net.com> <3A27AC06.59D6D8C0@micro-net.com>
- Reply-to: stolfi@xxxxxxxxxxxxx
- Sender: jim@xxxxxxxxxxxxx
> Hello, Jorge! I need your help. I'm still trying to
> make sense of this;
>
> > I'm still trying to make sense of the 8200 words in
> > the VMs. Robert Firth's paradigm accounts for ~80% of
> > the tokens. This 80% is about 280 words. But that
> > leaves ~7900 words that the paradigm doesn't account
> > for.
> > Also - in Hamptonese I saw words that were obviously
> > the same but the spelling differed. Could we seeing
> > this in the VMs.
>
> Suppose the 8200-word dictionary was caused by:
>
> 1) A lax orthography, as most orthographies of the time
> were, and
>
> 2) the majority of the tokens are one syllable, but
> some are two or three syllables, not necessarily from
> the same word.
>
> Do you have any info that would enlighten these
> possibilities?
Well, those are basically the main straws the Chinese Theory is
desperately clinging to...
A while ago I posted a proposal for counting Voynichese words in a way
that would take into account those possible sources of polymorphism,
plus a few others:
> 1. discard all tokens with "weirdos", defined as characters that
> occur less than, say, 5 times in the whole manuscript.
> (Those are postulated to be abbreviations, special symbols,
> os slips of the pen.)
>
> 2. Discard all line-initial words. (Those are likely to be
> "Grove-isms" --- words with a spurious prefixed letter.
> Also, there is the suspicion that line-initial <o>s
> often got written as <y>.)
>
> 3. Discard all line-final words, and words right before
> breaks caused by intruding figures. (Those are likely
> to be abbreviated, a suspicion supported by the
> apparent excess of <m> and <g> in those positions.)
>
> 4. Discard all tokens that fail the crust-mantle-core paradigm:
> i.e that have two or more gallows, or two gallows/benches
> separated by a dealer { d l r s x j } or final { m n g }.
> (Those are postulated to be word pairs that were joined,
> accidentally or on purpose.)
>
> 5. Discard all tokens that have an embedded <y>, or an embedded
> final letter { m n g }. (Those are assumed to be joined words too.)
>
> 6. If a token has two discrepant readings, flip a coin to decide
> whether to discard it or not. (If we keep either reading, we have
> very roughly 50% chance of choosing the wrong one and counting one
> extra bogus word. If we discard the token, we have very roughly
> 50% chance of missing only instance of that word. So by flipping a
> coin we are roughly balancing the two biases. It is left as an
> exercise to the reader to figure out the right policy when there
> are K divergent readings, or when they disagree on a word break.
> Extra points for taking into account the word distribution,
> so that if the two readings are <daiin> and <jeeeeb>, we take
> the <daiin>.)
>
> 7. Erase all occurrences of the letter <q>. (Since it does
> not occur in labels, we can assume that it is not part
> of the word -- perhaps an article, a symbol for "and", etc.)
>
> 8. Map all occurrences of <p> to <t>, and <f> to <k>.
> (Since { p f } occur almost exclusively in paragraph-initial
> lines, we can assume that they are capitalized version of other
> letters; from their context, the latter seem to be <t> and <k>, or
> perhaps <te> and <ke>.)
>
> 9. Look for unusual letters and n-grams that are graphically similar to
> common ones, and map them to the latter: <j> -> <d>, <se> -> <sh>,
> <ith> -> <cth>, <ei> -> <a>, <ak> -> <ok>, etc. (As Bayes would
> argue, these are more likely to be slips of the pen than rare
> but legitimate combinations.)
>
> 10. Look for maximal sequences of <e>s, and handle them as follows:
> <e> - leave it alone; <ee> - map it to <ch>; <eee> - flip a coin,
> heads discard the token, tails replace by <ech> or <che> at
> random; <eeee> - discard the token. (Syntactically, <ee> behaves
> pretty much like <ch>, and therefore we can assume that it is a
> <ch> whose ligature got omitted. Under that assumption, <eee> has
> two possible parsings, and thus should be treated as a token with
> two readings, as discussed above. The group <eeee> is so rare that
> it is not worth bothering about.)
>
> 11. Delete all occurrences of { a o }. (Those are suspected of being
> tone or pitch marks, although that is admittedly only a hunch.
> Pitch/tone marks are likely to be inserted in varying places, and
> often omitted, thus giving rise to several homonyms for the same
> syllable. There are hints that <y> may belong to that class too,
> but in other ways it seems to be a final letter, in the same class
> as <aiin> etc.)
>
> 12. Split off the crust prefixes, i.e. any dealers { d l r s } that
> occur before a gallows or bench letter: <rchedy> -> <r> + <chedy>,
> <lkeedy> -> <l> + <keedy>, etc. (If the dealers are vowels
> while the gallows/benches are consonants, as assumed in the
> simplest variant of the Chinese theory, then those prefixed
> dealers must be consonant-less syllables that the author chose to
> attach to the following syllable.)
>
> The number of distinct words that remain after this filtering
> should be compared to the number of distinct syllables in
> the candidate language, tone marks omitted.
If no one is willing to do this analysis, I plan to do it after
the school semester is over. (Due to past strikes, that means sometime
in february 2001...).
All the best,
--stolfi