[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

VMs: Interlinear block codes, revised...?



  > [Nick Pelling:] I think a good (though perhaps somewhat boring)
  > project would be to produce a new version of the interlinear text with
  > rationalised block codes, so that programmes (like mine) can sensibly
  > filter in/out blocks of text - ie, to examine the stats for the star
  > labels, the pharma plant labels, titles, etc.

Nick, indeed the block IDs (the part of the locator that lies
between the page ID and the line number) were assigned rather
randomly. I know (better than most!) how much trouble that means to
programmers. However please keep in mind that
  
  (1) the file grew gradually over time, starting from the
  paragraph-only files by Friedman et al. --- who transcribed only the
  paragraph text.
  
  (2) many parts of the file were formatted before convenient access
  to page images, so text that was originally entered as a single
  text unit (block) had to be split into two or more units, or vice-versa;
  
  (3) Text unit boundaries and types are still uncertain in some
  cases. Once we decipher the manuscript, we will surely find that
  some of our distinctions (say, between circular and paragraph text)
  are quite pointless, while some visually homogeneous units should be
  split into separate parts (section title, ingredients, disease list,
  etc.)

When I converted Gabriel's interlinear to EVA, and added the new
material, I tried to maintain some compatibility with previous
versions and with the official EVT format (as accepted by Rene's VTT).
That is why there are line numbers like "0a" or "21b" (which are even
more troublesome to programmers than the random unit IDs).

Because of the above uncertainties, I don't think that we are ready to
assign *the* definitive block and line numbers. If we do a global
renumbering of blocks and lines now, we will surely have to change
them again later. On the other hand, merging two files with very
different line numbers is a lot of hard work. Therefore, I believe
that, at this point, compatibility is more important than consistency.
I vote for preserving the current locators as much as possible.

The INDEX file that comes with version 1.6e6 of the interlinear
contains one line for each text block. Two fields that you may find
helpful are (1) a "block sequence number", in presumed reading order,
and (2) a "block type code", a single word that describes the kind of
text in the block. My classification seems to roughly match your
proposed categories.

My approach when processing VMS text has been to assume that block IDs
are random strings, which are temporarily mapped to numbers and/or
categories, when needed, through the INDEX table.  In my "Notebooks"
directories you will find scripts and tables for that purpose.
If you use Linux/Unix, you may find the following awk script
useful: 

  http://www.ic.unicamp.br/~stolfi/PUB/bin/map-field

  > Also: I'm assuming that TEXT16E6.EVT is the latest version of the
  > interlinear to work from, but does anyone have any updates or
  > corrections they'd like to make to it?

I have many minor corrections, which I have been saving for the long
overdue release 1.6e7. Unfortunately I do not know when I will have
time to do it; certainly not before dec/2003. (For one thing, I have
18 months of mostly-unread VMS mail to go through. Are there any new
transcriptions out there? Glen, Gabriel, Rene...?) In case you can't
wait, here is a tarfile (+gzip) of the contents of my working
directory, as of my last fix.

  http://www.ic.unicamp.br/~stolfi/voynich/2002-09-15-pre16e7.tgz

All the best,

--stolfi
______________________________________________________________________
To unsubscribe, send mail to majordomo@xxxxxxxxxxx with a body saying:
unsubscribe vms-list