[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Entropy & alphabets



Gabriel Landini wrote:
> 
> Hi all
> I did a small update on the effect of the alphabet on the entropy in
> the VMS.
> 
> http://web.bham.ac.uk/G.Landini/evmt/commas.htm
> 
> (near the end, a new graph in various alphabets)

	Very interesting.  The graphs form various families. 
Dalgarno is a little lower than the
Curva/Currier/FSG/Gava family.  Also very interesting
is how English entropy crosses Latin entropy at 100,000
characters and becomes higher than Latin.  Can you
think of a reason for this?  Dalgarno and English both
start to slope up at ~3000 characters.  

	Bear in mind that these curves come from different
character sets of widely differing sizes.  That's why
the h1-h2 number is useful; it indicates how good the
language/orthography is when groups of characters are
involved and is at least somewhat valid for comparisons
between character sets of differing sizes.  Could you
give us h1-h2 values for 20k or more characters?

	A comment on Rene's "Character Entropy to Word
Entropy".   He shows that the latter part of Voynichese
words carry much more info than the first part.  Could
this be due to the tripartite structure of Voynichese
words?  Say you have 20 beginning groups.  Then you
might have 20 * 20 = 400 groups that constitute the
rest of the word.  Rene might well have taken this into
account and I didn't notice it.  

	I've been talking to you and Jorge about the great
disparity between the group of words that fit the Firth
paradigm (~ 300), the paradigm which fits 75-80 % of
the text, and the total number of words in the text
(8200).  I thought that I might be able to represent
Voynichese in a two-dimensional grid.  The tripartite
nature of Jorge's grammar of Voynichese words shows
that I'll need a three-dimensional grid. This
complicates things enormously, unless there is
something to restrict the third dimension.  For
instance, the two-dimensional grid accounts for the
bulk of the tokens; perhaps the remainder is for
uncommon words.  

	In an attempt to gain insight into this, I took
Tiltman's and Firth's paradigms and classified the
internal parts of each according to Jorge's grammar of
Voynichese words.  The result is appended below.  

	In Firth's paradigm, a lot of prefixes are crust
only.  Some are crust-core, some are mantle or
mantle-core, but none are crust-mantle.  Interestingly,
none of the suffixes even contain core.  Quite a few
are crust only or crust-IN.  The remainder are
mantle-crust. None are mantle-IN.  

	Any more ideas, anyone?  

Dennis

---------------------------------------------------------------------------------

{crust}
{crust-dealer}
{mantle}
{core}
{circle}
{preceding mantle/core}
{IN}


{Fig. 27 -- Tiltman's Division of Common Words into
"Roots" and "Suffixes" }
{(Tiltman 1951)  (Currier's Transliteration) }

{Roots                      Suffixes}

o{circle}k{core}            a{circle}n{crust-dealer}
                            a{circle}in{IN}
                            a{circle}iin{IN}
                            a{circle}iiin{IN}
o{circle}f{core}
o{circle}t{core}            a{circle}r{crust-dealer} 
                            a{circle}ir{IN}
                            a{circle}iir{IN}
                            a{circle}iiir{IN}
o{circle}p{core}   
q{crust}o{circle}k{core}    a{circle}l{crust-dealer}
                            a{circle}il{IN}
                            a{circle}iil{IN}
                            a{circle}iiil{IN}
q{crust}o{circle}f{core}
q{crust}o{circle}{core}t    o{circle}r{crust-dealer}
q{crust}o{circle}p{core}
ch{mantle}                  o{circle}l{crust-dealer}
sh{mantle}                  e{preceding
mantle/core}y{circle}
                            ee{mantle}y{circle}
                            ee{mantle}e{preceding
mantle/core}y{circle}
d{crust-dealer}             e{preceding
mantle/core}d{crust-dealer}y{circle}
                           
ee{mantle}d{crust-dealer}y{circle}
                            ee{mantle}e{preceding
mantle/core}d{crust-dealer}y{circle}
s{crust-dealer}



{Firth's paradigm, from his Work Note #24}

{Eventually, I decided to set the cutoff at}
{four occurrences: any group that occurs 4 or more
times is probably}
{genuine.  This removes about 20% of the text, but it
removes over}
{85% of the unique groups, and most of the remainder
look plausible.}

{What is Encoded?

{So, we have some 280 groups in the Voynich A, that
occur 4 or more times,}
{with the record being 355 for '8AM'.  If we assume
(pace Brumbaugh) that }
{every group has a single decode, then that sets an
upper bound at 280 }
{for the number of different plaintext units.  So
they're not words. }


{Odd Letters                Even Letters}

s{crust}                    d{crust-dealer}y{circle}
q{crust}o{circle}          
d{crust-dealer}a{circle}l{crust-dealer}
q{crust}o{circle}k{core}   
d{crust-dealer}a{circle}iin{IN}
q{crust}o{circle}t{crust}   a{circle}l{crust-dealer}
d{crust-dealer}             a{circle}m{crust}
y{circle}k{core}            a{circle}iin{IN}
y{circle}t{core}            a{circle}in{IN}
k{core}                     a{circle}r{crust-dealer}
o{circle}                   e{preceding
mantle/core}y{circle}
o{circle}k{core}            ee{mantle}y{circle}
o{circle}t{core}            e{preceding
mantle/core}o{circle}l{crust-dealer}
t{core}                     o{circle}l{crust-dealer}
cth{core}                   o{circle}iin{IN}
ch{mantle}                  o{circle}r{crust-dealer}
chk{core}                   ch{mantle}y{circle}
ch{mantle}t{core}           ch{mantle}e{preceding
mantle/core}y{circle}
ch{mantle}cth{core}         ch{mantle}o{circle}
ch{mantle}cph{core}        
ch{mantle}o{circle}l{crust-dealer}
ch{mantle}ckh{core}        
ch{mantle}o{circle}r{crust-dealer}
cph{core}                   sh{mantle}y
ckh{core}                   y{circle}(maybe)
sh{mantle}
sh{mantle}o{circle}


{ With the exception of that silly }
{letter '9' almost any combination of symbols is
locally decodable. }
{(Something's wrong with 8 or AM or 8AM; otherwise,
it's rigorous.) }

{[Note: and also with S/OM and S/OR.  But - and as a
former compiler }
{writer I should really have spotted this - the lexical
scansion is }
{unambiguous if you also keep track of odd and even. 
'8' in state }
{"odd" must be a letter; '8' in state "even" must be
the start of }
{'89' or '8AE' or '8AM'. }