[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Benchmark transcription file

To: voynich@xxxxxxxx
Subject: Re: Benchmark transcription file
From: Jorge Stolfi <stolfi@xxxxxxxxxxxxx>
Date: Sun, 12 Aug 2001 13:06:12 -0300 (EST)
In-reply-to: <3B754876.D58A13FA@mail.msen.com>
References: <20010809222149.68887.qmail@web9107.mail.yahoo.com> <5.1.0.14.0.20010811090144.0269d9b0@mail.globalnet.co.uk> <3B754876.D58A13FA@mail.msen.com>
Reply-to: stolfi@xxxxxxxxxxxxx

    > [Bruce Grant:] It seems to me that it would be nice to have a
    > sort of "benchmark transcription" of the VMS extracted from the
    > EVA transcription which would have these characteristics:
    > 
    >    * there would be only one transcription (one of the existing
    >    ones or a blend)
    >
    >    * there would be a single transcription alphabet, whatever
    >    the "consensus" is today (EVA?)

There are two such transcriptions available today:
 
  (1) Takeshi Takahashi's full transcription, which is included in 
  the interlinear file with tag "H":
  http://www.dcc.unicamp.br/~stolfi/voynich/98-12-28-interln16e6/

  (2) A majority-vote synthesis of all transcriptions that are
  available in the interlinear file:
  http://www.dcc.unicamp.br/~stolfi/voynich/00-06-07-word-grammar/Notes/045/only-m.evt

Version (2) was obtained by comparing all available transcriptions of 
each character position, and selecting the EVA reading which was
chosen by more than half of the transcribers --- or "*" if there 
was no such majority reading. 

Therefore, version (2) has more holes than (1), and its readings may
have a section-depedent bias (since different sections were covered by
different people). On the other hand, if "*"-words are excluded, what
remains of version (2) should be somewhat more trustworthy than any
single version.

    >    * the format would be completely uniform and without
    >    comments, to facilitate analysis by a program

Both versions are in a subset of the EVMT format, designed by Rene and
Gabriel specifically to make as much information as possible available
for computer processing. 

In version (1), line comments are identified by a "#" on column 1, and
embedded comments are enclosed in "{}", and therefore are easily
removed with any good text editor. Version (2) contains no comments.

The "[|]" convention for alternate readings is NOT used in either
file; instead, the two alternatives are given on separate lines, as
two different transcriptions. Both versions include page header lines,
like "<f2v> {$I=H $Q=A $P=D $L=A $H=1}", which can be recognized by
the presence of "$" or by the absence of "." between the "<>".

Both versions use

  "*" for unreadable/dubious chatacters,
  "%" for untranscribed text, 
  "." for definite word space
  "," for possible word space
  "-" for end-of-line or major gap within line
  "=" for end-of-paragraph or end-of-label.
  "!" as a synchronization filler
  
Rene's "VTT" tool can be used to remove the comments and page headers,
and perform many other useful tasks such as deleting fillers, mapping
all separators to " ", adding or removing ligature capitalization,
etc. (I haven't tested it recently, though. If you find that it 
doesn't like my interlinear file, please let me know.)

    >    * the location information (folio, locus, line etc.) would be
    >    represented in a consistent way throughout (e.g. no "line 0")

I suppose that the basic line numbering scheme, starting from 1 on
each page, was defined by Jim Reeds, Jacques, and/or Gabriel when they
prepared the first version of the interlinear file, from W. Friedman's
punched card transcription.

Unfortunately, Friedman's transcription skipped a few lines (or line
breaks) here and there. When those lines came up in newer
transcriptions, they had to be given numbers such as "0", "10a", etc.
in order to (a) still reflect their logical order in the text, and (b)
preserve the existing line numbers, which we had been using for years
in the list. (Also, some of Friedman's lines were bogus, so watch out
for gaps in the present line numbering.)

Granted, those quirks in the line numbering hamper certain kinds of
computer analysis; but we already have to cope with page numbers like
"f82v1", anyway. The right approach is to regard tose strings as line
*names*, and use the physical order within the file whenever a the
line *number* is needed.

    >    * the distinction of "text" and "labels" would be indicated
    >    explicitly.
    
In the interlinear page above, you should find an INDEX which
specifies the type (text, labels, "titles", isolated letters, etc.) of
each part of the interlinear file.

With that information, and a helpful editor, you should be able to
extract text-only and labels-only files.
    
    >    * after being created and the formatting verified as correct,
    >    the file would not change.
    > 
    > Even though such a file would contain less information than the
    > EVA, it would be useful for the various statistical tests that
    > people do, making these these tests directly comparable and
    > repeatable, and allowing the group to accumulate a set of
    > consistent and useable statistics over time.
     
This goal is too naive. Since the text is handwritten, and the
alphabet is unknown, the correct reading of each character cannot be
determined with certainty at the present time. Moreover, transcription
is very hard work, and even the most careful person will make a few
errors every 1000 characters or so.

As better images become available, people will inevitably provide more
accurate transcriptions, and we would be foolish not to use such
cleaner texts in the analysis. (After all, our object of study is the
VMS, not some arbitrarily mangled transcription of it.)

So we must all learn to live with inconsistent statistics that result
from different input files. In fact, we must get used to the fact that
all statistics are inevitably noisy (because of scribal errors, if
nothing else) and incomplete (because of all those missing pages). A
statistical conclusion should not be trusted unless it can survive
random replacement of 1-2% of the text's characters


All the best,

--stolfi

PS. Just to keep the names clear: "EVMT" is an ongoing transcription
project by Rene and Gabriel; "EVA" is just the transcription alphabet
developed for the EVMT (and adopted by several other people in the
list, including Takeshi and myself); and the "EVMT format" specifies
EVA for the text, plus specific conventions for comments, page and
line numbers, fillers and separators, etc..

Follow-Ups:
- Re: Benchmark transcription file
  - From: Bruce Grant
- Re: Benchmark transcription file
  - From: Steve Ekwall

References:
- Collaboration on VMS
  - From: King Mordecai
- Re: Collaboration on VMS
  - From: Nick Pelling
- Benchmark transcription file
  - From: Bruce Grant

Prev by Date: Documents Forensic
Next by Date: Re: Documents Forensic
Previous by thread: Re: Benchmark transcription file
Next by thread: Re: Benchmark transcription file
Index(es):
- Date
- Thread