[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
RE: More Chinese stuff: Shennong Bencao
Could the VMS recipes section be a phonetic transcription of the
Shennong Bencao Jing, the classical Chinese pharmacopoeia?
First, here are some comparative statistics between the Shennong
Bencao text that I posted earlier (without punctuation) and the VMS
recipes section (majority vote version of the EVA interlinear file
v1.6e6).
In the Bencao, a "token" is defined as an occurrence of one Chinese
character, i.e. a syllable. In the VMS, a word space was inserted at
every point where at least one transcriber saw a word space. In both
cases a "word" is a token that appears in the text, not counting
repetitions.
Number of recipes:
The Bencao is said to contain 365 recipes, although the text that I
have has only 357. (This appears to be a property of the source
manuscript, not a transcription bug.)
The "recipes" section of the VMS has 323 "stars" in 23 pages
(f103r-f108v and f111r-f116r), not counting the large block
of text at the end of f116r. Four pages are clearly missing;
if they contained 14 recipes each (the average for surviving pages)
the total number should have been around 379. If one of those
pages contained other material, then the estimate would drop to 365.
Number of tokens:
The Bencao contains 12826 tokens, or
35.93 tokens/recipe (deviation: ±9.22).
The extant VMS recipes contain 10540 tokens, or
32.63 tokens/recipe (deviation: ±11.15).
Here are the histograms of recipe lengths, in tokens, for both files:
http://www.ic.unicamp.br/~stolfi/voynich/misc/recipe-tk-hist.png
Number of distinct words:
The Bencao uses 1113 distinct words (syllables)
The VMS recipes uses 2760 distinct words, not counting 640 tokens
with unreadable/contentious characters ("*") in them.
Word repeats:
The Bencao contains 41 word repeats
(occurrences of ... X X ... consecutive pairs where X is any word).
The most common words appearing in the Bencao repeats are
8 pairs: big5 "¬~" = jis "?ô"
6 pairs: big5 "¦å" = jis "??"
5 pairs: big5 "´H" = jis "?¦"
The VMS recipes contain 77 word repeats; extrapolating from 23 to
27 pages, that would be 90 repeats. The most common words in
those repeats are
10 pairs: eva "qokeedy"
10 pairs: eva "qokeey"
7 pairs: eva "ar"
What do we make of these numbers? Before drawing any conclusions,
we should note that:
* The recipe boundaries are not always well-marked in the VMS.
The four clues we could use are
(1) the last line is shorter,
(2) the first line starts or contains p/f gallows,
(3) the last line ends with rarities such as "m", "g", "dl", etc.
(4) there is a star at the left of the first line.
On favorable pages, we find good correlation between these
clues, especially (1), (2), and (4), except that the star
is often seen to be displaced vertically by as much as one line
from its correct position. On some pages, on the other
hand, there are long blocks of text without clue (1),
and the star positions often seem to be inconsistent with clues
(2) and (3). Sometimes even the number of stars seems to
be wrong. For instance:
on page f103v, star #8 seems to be too high (it implies a 1-line
recipe followed by a 5.5-line one).
on page f108v, stars #4, #7, #8, #10-#16 seem suspicious.
on page f111r, there may be one star too many among #2--#11.
on page f111v, there may be one star too many among starts #2--#10.
Also, the scribe seems to have lost the sync in the bottom half.
I would say that
#13 is 1 line too high,
#14 is 2 lines too high,
#15 is one paragraph too high
#16, #17 are 2-3 lines too high, and
#18 is 1/2 line too high.
on page f113r, the scribe either missed a star after #8, or
shifted stars #9-#15 down by one paragraph, and #16 down by 4.5 lines
on page f115r the scribe may have omitted a star
after #5 and/or after #9.
The Bencao recipe length histogram is fairly compact, with most
recipes containing between 25 and 45 tokens. Mistakes in recipe
boundaries would make the histogram wider, by creating and excess
of shorter and longer recipes -- just as we see in the VMS
histogram.
Considering all these problems, the match between the two recipe
length histograms seems actually surprisingly good. Note in
particular that both histograms have roughly the same range, a
sharp increase at 25 tokens, a linear decrease from 45 to 60
tokens, and a small bump at >70 tokens. The latter is due to the
following recipes
Voynich:
74 tokens: recipe 064 page 105r lines 16-22
71 tokens: recipe 079 page 105v lines 32-38
72 tokens: recipe 247 page 113r lines 45-51
Bencao:
72 tokens: recipe 099 page 01-13 lines b09-b10 and a01-a02
92 tokens: recipe 109 page 01-15 lines a05-a09
72 tokens: recipe 171 page 02-07 lines a02-a05
However, note that these extra-long recipes may be accidental
joins of two mormal-size ones (30-40 tokens), in which case it is
pointless to try to match them.
* The mapping between Chinese characters and VMS words may
not be one-to-one.
There are many factors that could cause discrepancies here. First,
there is much uncertainty in the VMS transcription, especially on
the pairs a/o, r/s, ch/ee, etc. There is even more uncertainty
about word spaces: some tokens may have been incorrectly joined or
split.
Furthermore, the same Chinese character may be pronounced in
different ways depending on the context: in particular, certain
tones change into others, in a regular fashion, depending on the
tone of the preceding syllable. The Chinese text does not register
these chanesg, but a phonetic transcription would.
Moreover, the same Chinese sound may have been written in several ways
by the author. This usually happens when one is taking dictation
in an unfamiliar or unknown language. Also, those frequent m/g
in line-final position suggest that the Vounichese text contains
many abbreviations. Also, the use of p/f "capitals" in the Voynich
script is expected to inflate the number of different words.
All these factors would increase the apparent size of the VMS
vocabulary. Note that if 15% of the Voynichese tokens were
affected by those factors, that would suffice to explain the
discrepancy observed in the vocabulary sizes (1113 distinct words
in the Bencao, 2760 in the VMS). Threfore, this discrepancy,
although substantial, still doesn't quite prove that they are
different texts.
For the same reasons, even if VMS turns out to be a transcription
of the Bencao, we should not expect to have a simple word-for-word
match between the two texts.
In conclusion, I would say that the jury is still out on this one...
All the best,
--stolfi