[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Some thoughts about process



Bruce Grant wrote:
> 
> These are a couple of thoughts, admittedly not completely thought
> throught, about the _process_ of trying to crack the VMS:
> 
> 1.   Entropy is a popular measure to calculate and speculate with, but
> it depends sensitively on the definition
>         of what the alphabet is. Would it be possible to develop some
> type of measure which would be
>         independent of the alphabet? I am thinking of something like the
> statistics used for data without
>         numerical values (e.g. rankings rather than measurements). Even
> if such a measure would not allow
>         you to say "this text has the same information content as Latin
> with every letter replaced by a pair"
>         or something like that, it might allow you to say "the text
> becomes more repetitive in the middle than
>         at the beginning" or so on.

	Look at Stolfi's 
Where are the bits?
http://www.dcc.unicamp.br/~stolfi/voynich/98-07-09-local-entropy/

Of course, he used entropy to do this, but it does
allow you to compare different texts and different
characters in context.

	I've struggled a lot with how entropy is dependent on
the alphabet chosen.  There's no way around the fact
that you can *potentially* include more information in
a larger character set.  You can also do a worse or
better job of using any character set.  

	You can compute the maximum entropy of any order for
any size character set, of course.  If you want to
exclude the effect of rare characters (the ampersand &
in typical English text, the picnic table in
Voynichese), use the number of characters that comprise
say 99% of the text.  Then compare the entropy
calculated for a given text to the maximum entropy of
the 99% character set.  

	But how do different languages compare on
communication efficiency?  Japanese is supposed to be a
vague and ambiguous language in speech; it has a
phonemic inventory of about 16 consonants and 10 vowels
(with short/long vowel distinction).  However, Jacques
told me that Tahitian is no more ambiguous that English
or French, and it probably has a phonemic inventory,
like Hawai'ian, of 10 vowels (again including
long/short) and and 8 consonants!!!  There the entropy
numbers help you not at all.

> 2.    A lot of interesting ideas, such as the current discussion of
> gallows letters, are floated, and calculations
>         are produced, etc. then the ideas disappear into the archives.
> Is there some way to gather
>        such quantitative questions or theories and the resulting
> statistics about ut the VMS into one place
>       (say, a FAQ) where  you could look at it all at once?

	It would certainly be nice to have something like
that, but it would be a lot of work for one person at
least -- and then it would reflect only the ideas and
prejudices of one person!  

	We could have a list of links to the major papers list
members have written on topics such as entropy,
Voynichese word paradigms, Zipf's law, the strokes of
the Voynich characters, etc.  There are fewer of these
papers, so prejudice would be a smaller factor. Such a
collection might be a good idea; it ought to be like
having a book called "The VMs since D'Imperio".  

	However, my impression is that those papers are
relatively long on statistics and relatively short on
hypotheses.  Does anyone else have the same
impression?  

> 3.    In order to do item #2, would it be worthwhile to try to produce
> from the EVA transcription file a
>         single machine readable transcription in a convenient form for
> compuertized processing, even if
>         it were necessary to:
> 
>                 make some kinds of assumptions
>                 omit parts that cannot be reconciled between the
> different versions
>                 convert to a/the standard alphabet
>                 standardize the line-numbering scheme
>                 etc.?
> 
> 4.    If this were done, what would be a useful format?  XML?
> Relational tables? Simple lists of lines or words? Something else?

	We have Takeshi's complete transcription and Stolfi's
majority version, so we already have useful versions in
EVA text.  Beyond that I really don't know.  Little
work has been done on the syntax of Voynichese, so it's
hard to say what database format would be good.    

	BTW.  For my work on Hamptonese I've prepared a short
corpus of phonetic English.  By distinguishing upper-
and lower-case letters, I have one phoneme per
character.  I'll put it on my Hamptonese page in a
while.  Phonetic English has a somewhat high entropy
drop (h1-h2):

Language + writing          h1-h2
-------------------------   -----------
Latin                       ~0.7
English                      0.83-0.94
Phonetic English             0.95
Hawai'ian (full phonemic)    0.92
Japanese (romaji)           ~1.1
Hamptonese                   1.2
Voynichese (EVA)             1.8

	Hamptonese has ~10 vowels and ~21 consonants, so I'm
assuming it's phonetic English.  It still looks quite
weird; it may teach us something about the VMs.

Dennis