[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Some thoughts about process

To: voynich@xxxxxxxx
Subject: Re: Some thoughts about process
From: Jorge Stolfi <stolfi@xxxxxxxxxxxxx>
Date: Mon, 18 Sep 2000 19:02:14 -0300 (EST)
Delivered-to: reeds@research.att.com
In-reply-to: <39C219ED.4861CB9C@mail.msen.com>
References: <39C219ED.4861CB9C@mail.msen.com>
Reply-to: stolfi@xxxxxxxxxxxxx
Sender: jim@xxxxxxxxxxxxx

    > [Bruce Grant:] Entropy is a popular measure to calculate and
    > speculate with, but it depends sensitively on [the alphabet].
    
Indeed, entropy is a property of the encoding, not of the language
or the contents.
    
    > Would it be possible to develop some type of measure which would
    > be independent of the alphabet?
    
The character-based n-th order entropy is independent of simple
substitution, but is affected by almost anything else --
polyalphabetic and Vigenère ciphers, multi-character and
variable-length substitutions, nulls, etc. Word-based statistics and
character correlation analysis (as done by Gabriel Landini and Mark
Perakh) are somewhat more robust, but not much.  In the limit,
a very efficient encoding (like "zip") will produce 
random-looking, uniform-probability strings with no 
apparent structure.

    > 3. In order to do item #2, would it be worthwhile to try to produce
    > from the EVA transcription file a single machine readable transcription
    > in a convenient form for compuertized processing [...]
    
Takeshi Takahashi prepared a transcription which is essentially 
complete. (I recall that at some point
it was missing a couple of lines or a few labels, but that 
is probably fixed by now.)  That version has been
incorporated in the EVA interlinear (code H).

I also have 
 
  a "consensus" merge of all transcribers (code Y)
 
  a "majority vote" merge of all transcribers (code A)
  
These two versions have "*"s where there was no consensus or no
absolute majority, respectively.  You will find them in 

  http://www.dcc.unicamp.br/~stolfi/EXPORT/projects/voynich/
  
The files are
  
  00-06-07-word-grammar/Notes/045/only-m.evt.gz (or .zip)
  00-06-07-word-grammar/Notes/045/only-c.evt.gz (or .zip)

These files are in essentially standard EVT format, with
page and line numbers,  comments, fillers, etc.
If you can compile C code, you can install Rene's VTT program
and use it to extract the bare text, without these decorations.

    > If this were done, what would be a useful format?  XML?
    > Relational tables? Simple lists of lines or words?
    
I generally start from the EVT file, and use VTT and/or little AWK
programs to extract word lists, concordances, etc..

All the best,

--stolfi

Follow-Ups:
- Re: Some thoughts about process
  - From: Rene Zandbergen

References:
- Some thoughts about process
  - From: Bruce Grant

Prev by Date: Re: Some thoughts about process
Next by Date: Re: Some thoughts about process
Previous by thread: Re: Some thoughts about process
Next by thread: Re: Some thoughts about process
Index(es):
- Date
- Thread