[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Benchmark transcription file



Jorge Stolfi wrote:


> There are two such transcriptions available today:

and

> Granted, those quirks in the line numbering hamper certain kinds of
> computer analysis; but we already have to cope with page numbers like
> "f82v1", anyway. The right approach is to regard tose strings as line
> *names*, and use the physical order within the file whenever a the
> line *number* is needed.

This is good ... the second file you mention is pretty much what I was looking for.

The line number thing is a little unfortunate, though - if the lines were really numbered
sequentially you could figure out if two lines are contiguous just by comparing the
numbers. This would be useful if you were working with individual words instead of lines.

(In some programs I was playing with, I reformatted the lines into one line per word,
keeping track of the line number, position from start of line and position from end of
line as well as page and locus. This way, if a pair of words consisted of the last word of
a line and the first word of the next line you knew they were consecutive.)

One possibility would be to renumber the lines in the machine-readable file and keep a
table to let you map back to the traditional numbers when you want to talk about them. In
fact, you could do the same thing for page numbers as well.

Bruce