[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Some thoughts about process
> [Bruce Grant:] Entropy is a popular measure to calculate and
> speculate with, but it depends sensitively on [the alphabet].
Indeed, entropy is a property of the encoding, not of the language
or the contents.
> Would it be possible to develop some type of measure which would
> be independent of the alphabet?
The character-based n-th order entropy is independent of simple
substitution, but is affected by almost anything else --
polyalphabetic and Vigenère ciphers, multi-character and
variable-length substitutions, nulls, etc. Word-based statistics and
character correlation analysis (as done by Gabriel Landini and Mark
Perakh) are somewhat more robust, but not much. In the limit,
a very efficient encoding (like "zip") will produce
random-looking, uniform-probability strings with no
> 3. In order to do item #2, would it be worthwhile to try to produce
> from the EVA transcription file a single machine readable transcription
> in a convenient form for compuertized processing [...]
Takeshi Takahashi prepared a transcription which is essentially
complete. (I recall that at some point
it was missing a couple of lines or a few labels, but that
is probably fixed by now.) That version has been
incorporated in the EVA interlinear (code H).
I also have
a "consensus" merge of all transcribers (code Y)
a "majority vote" merge of all transcribers (code A)
These two versions have "*"s where there was no consensus or no
absolute majority, respectively. You will find them in
The files are
00-06-07-word-grammar/Notes/045/only-m.evt.gz (or .zip)
00-06-07-word-grammar/Notes/045/only-c.evt.gz (or .zip)
These files are in essentially standard EVT format, with
page and line numbers, comments, fillers, etc.
If you can compile C code, you can install Rene's VTT program
and use it to extract the bare text, without these decorations.
> If this were done, what would be a useful format? XML?
> Relational tables? Simple lists of lines or words?
I generally start from the EVT file, and use VTT and/or little AWK
programs to extract word lists, concordances, etc..
All the best,