[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Some thoughts about process

    > [Bruce Grant:] Entropy is a popular measure to calculate and
    > speculate with, but it depends sensitively on [the alphabet].
Indeed, entropy is a property of the encoding, not of the language
or the contents.
    > Would it be possible to develop some type of measure which would
    > be independent of the alphabet?
The character-based n-th order entropy is independent of simple
substitution, but is affected by almost anything else --
polyalphabetic and Vigenère ciphers, multi-character and
variable-length substitutions, nulls, etc. Word-based statistics and
character correlation analysis (as done by Gabriel Landini and Mark
Perakh) are somewhat more robust, but not much.  In the limit,
a very efficient encoding (like "zip") will produce 
random-looking, uniform-probability strings with no 
apparent structure.

    > 3. In order to do item #2, would it be worthwhile to try to produce
    > from the EVA transcription file a single machine readable transcription
    > in a convenient form for compuertized processing [...]
Takeshi Takahashi prepared a transcription which is essentially 
complete. (I recall that at some point
it was missing a couple of lines or a few labels, but that 
is probably fixed by now.)  That version has been
incorporated in the EVA interlinear (code H).

I also have 
  a "consensus" merge of all transcribers (code Y)
  a "majority vote" merge of all transcribers (code A)
These two versions have "*"s where there was no consensus or no
absolute majority, respectively.  You will find them in 

The files are
  00-06-07-word-grammar/Notes/045/only-m.evt.gz (or .zip)
  00-06-07-word-grammar/Notes/045/only-c.evt.gz (or .zip)

These files are in essentially standard EVT format, with
page and line numbers,  comments, fillers, etc.
If you can compile C code, you can install Rene's VTT program
and use it to extract the bare text, without these decorations.

    > If this were done, what would be a useful format?  XML?
    > Relational tables? Simple lists of lines or words?
I generally start from the EVT file, and use VTT and/or little AWK
programs to extract word lists, concordances, etc..

All the best,