[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: 8200 Voynichese Words

To: Dennis <ixohoxi@xxxxxxxxxxxxx>
Subject: Re: 8200 Voynichese Words
From: Jorge Stolfi <stolfi@xxxxxxxxxxxxx>
Date: Sat, 9 Dec 2000 22:34:44 -0200 (EDT)
Cc: voynich@xxxxxxxx
Delivered-to: reeds@research.att.com
In-reply-to: <3A27AC06.59D6D8C0@micro-net.com>
References: <BPEOIKLPOIDECCHIOEMCGEEGCAAA.Claus_Anders@t-online.de> <39D02EA1.6011D2B3@mail.msen.com> <39D09ABF.DF482B07@gte.net> <39D0B94D.55710F91@alphalink.com.au> <200009270209.e8R29xC02004@coruja.dcc.unicamp.br> <39D19E14.7273E51C@alphalink.com.au> <200010090043.e990hVB29746@coruja.dcc.unicamp.br> <39E27A1C.E18F190@micro-net.com> <3A27AC06.59D6D8C0@micro-net.com>
Reply-to: stolfi@xxxxxxxxxxxxx
Sender: jim@xxxxxxxxxxxxx

    > 	Hello, Jorge!  I need your help.  I'm still trying to
    > make sense of this;
    > 
    > >         I'm still trying to make sense of the 8200 words in
    > > the VMs.  Robert Firth's paradigm accounts for ~80% of
    > > the tokens.  This 80% is about 280 words.  But that
    > > leaves ~7900 words that the paradigm doesn't account
    > > for. 
    > >         Also - in Hamptonese I saw words that were obviously
    > > the same but the spelling differed.  Could we seeing
    > > this in the VMs.
    > 
    > 	Suppose the 8200-word dictionary was caused by:
    > 
    > 1) A lax orthography, as most orthographies of the time
    > were, and
    > 
    > 2)  the majority of the tokens are one syllable, but
    > some are two or three syllables, not necessarily from
    > the same word.  
    > 
    > 	Do you have any info that would enlighten these
    > possibilities?

Well, those are basically the main straws the Chinese Theory is
desperately clinging to...

A while ago I posted a proposal for counting Voynichese words in a way
that would take into account those possible sources of polymorphism,
plus a few others:

    >    1. discard all tokens with "weirdos", defined as characters that
    >       occur less than, say, 5 times in the whole manuscript.
    >       (Those are postulated to be abbreviations, special symbols,
    >       os slips of the pen.)
    > 
    >    2. Discard all line-initial words. (Those are likely to be 
    >       "Grove-isms" --- words with a spurious prefixed letter. 
    >       Also, there is the suspicion that line-initial <o>s 
    >       often got written as <y>.)
    > 
    >    3. Discard all line-final words, and words right before 
    >       breaks caused by intruding figures. (Those are likely
    >       to be abbreviated, a suspicion supported by the 
    >       apparent excess of <m> and <g> in those positions.)
    > 
    >    4. Discard all tokens that fail the crust-mantle-core paradigm:
    >       i.e that have two or more gallows, or two gallows/benches
    >       separated by a dealer { d l r s x j } or final { m n g }.  
    >       (Those are postulated to be word pairs that were joined,
    >       accidentally or on purpose.)
    > 
    >    5. Discard all tokens that have an embedded <y>, or an embedded
    >       final letter { m n g }.  (Those are assumed to be joined words too.)
    > 
    >    6. If a token has two discrepant readings, flip a coin to decide
    >       whether to discard it or not. (If we keep either reading, we have
    >       very roughly 50% chance of choosing the wrong one and counting one
    >       extra bogus word. If we discard the token, we have very roughly
    >       50% chance of missing only instance of that word. So by flipping a
    >       coin we are roughly balancing the two biases. It is left as an
    >       exercise to the reader to figure out the right policy when there
    >       are K divergent readings, or when they disagree on a word break.
    >       Extra points for taking into account the word distribution,
    >       so that if the two readings are <daiin> and <jeeeeb>, we take
    >       the <daiin>.)
    > 
    >    7. Erase all occurrences of the letter <q>. (Since  it does 
    >       not occur in labels, we can assume that it is not part 
    >       of the word -- perhaps an article, a symbol for "and", etc.)
    > 
    >    8. Map all occurrences of <p> to <t>, and <f> to <k>. 
    >       (Since { p f } occur almost exclusively in paragraph-initial
    >       lines, we can assume that they are capitalized version of other
    >       letters; from their context, the latter seem to be <t> and <k>, or
    >       perhaps <te> and <ke>.)
    > 
    >    9. Look for unusual letters and n-grams that are graphically similar to 
    >       common ones, and map them to the latter: <j> -> <d>, <se> -> <sh>, 
    >       <ith> -> <cth>, <ei> -> <a>, <ak> -> <ok>, etc. (As Bayes would 
    >       argue, these are more likely to be slips of the pen than rare
    >       but legitimate combinations.)
    > 
    >   10. Look for maximal sequences of <e>s, and handle them as follows:
    >       <e> - leave it alone; <ee> - map it to <ch>; <eee> - flip a coin,
    >       heads discard the token, tails replace by <ech> or <che> at
    >       random; <eeee> - discard the token. (Syntactically, <ee> behaves
    >       pretty much like <ch>, and therefore we can assume that it is a
    >       <ch> whose ligature got omitted. Under that assumption, <eee> has
    >       two possible parsings, and thus should be treated as a token with
    >       two readings, as discussed above. The group <eeee> is so rare that
    >       it is not worth bothering about.)
    > 
    >   11. Delete all occurrences of { a o }. (Those are suspected of being 
    >       tone or pitch marks, although that is admittedly only a hunch.
    >       Pitch/tone marks are likely to be inserted in varying places, and
    >       often omitted, thus giving rise to several homonyms for the same
    >       syllable. There are hints that <y> may belong to that class too,
    >       but in other ways it seems to be a final letter, in the same class
    >       as <aiin> etc.)
    > 
    >   12. Split off the crust prefixes, i.e. any dealers { d l r s } that 
    >       occur before a gallows or bench letter: <rchedy> -> <r> + <chedy>,
    >       <lkeedy> -> <l> + <keedy>, etc. (If the dealers are vowels
    >       while the gallows/benches are consonants, as assumed in the
    >       simplest variant of the Chinese theory, then those prefixed
    >       dealers must be consonant-less syllables that the author chose to
    >       attach to the following syllable.)
    >     
    > The number of distinct words that remain after this filtering
    > should be compared to the number of distinct syllables in 
    > the candidate language, tone marks omitted. 

If no one is willing to do this analysis, I plan to do it after
the school semester is over. (Due to past strikes, that means sometime
in february 2001...). 

All the best,

--stolfi

Follow-Ups:
- Re: 8200 Voynichese Words
  - From: Dennis

Prev by Date: Re: Another language candidate for the VMS
Next by Date: Re: Another language candidate for the VMS
Previous by thread: Re: Another language candidate for the VMS
Next by thread: Re: 8200 Voynichese Words
Index(es):
- Date
- Thread