[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: More Chinese stuff: Shennong Bencao

To: voynich@xxxxxxxx
Subject: RE: More Chinese stuff: Shennong Bencao
From: Jorge Stolfi <stolfi@xxxxxxxxxxxxx>
Date: Fri, 22 Feb 2002 23:05:13 -0300 (EST)
Cc: song@xxxxxxxxxx, rt@xxxxxxxxxx
In-reply-to: <200202221642.g1MGgLm17717@neper.dcc.unicamp.br>
References: <3143B46F3796D51190390002A5518A182333C6@acnt45.ac1.dsh.de> <200202221642.g1MGgLm17717@neper.dcc.unicamp.br>
Reply-to: stolfi@xxxxxxxxxxxxx

Could the VMS recipes section be a phonetic transcription of the
Shennong Bencao Jing, the classical Chinese pharmacopoeia?  

First, here are some comparative statistics between the Shennong
Bencao text that I posted earlier (without punctuation) and the VMS
recipes section (majority vote version of the EVA interlinear file
v1.6e6).

In the Bencao, a "token" is defined as an occurrence of one Chinese
character, i.e. a syllable. In the VMS, a word space was inserted at
every point where at least one transcriber saw a word space. In both
cases a "word" is a token that appears in the text, not counting
repetitions.

  Number of recipes:

    The Bencao is said to contain 365 recipes, although the text that I
    have has only 357. (This appears to be a property of the source
    manuscript, not a transcription bug.)

    The "recipes" section of the VMS has 323 "stars" in 23 pages
    (f103r-f108v and f111r-f116r), not counting the large block
    of text at the end of f116r.  Four pages are clearly missing;
    if they contained 14 recipes each (the average for surviving pages)
    the total number should have been around 379.  If one of those
    pages contained other material, then the estimate would drop to 365.
    
  Number of tokens:

    The Bencao contains 12826 tokens, or 
    35.93 tokens/recipe (deviation: ±9.22).
    
    The extant VMS recipes contain 10540 tokens, or 
    32.63 tokens/recipe (deviation: ±11.15).
    
    Here are the histograms of recipe lengths, in tokens, for both files:
    
      http://www.ic.unicamp.br/~stolfi/voynich/misc/recipe-tk-hist.png
      
  Number of distinct words:
  
    The Bencao uses 1113 distinct words (syllables)
    
    The VMS recipes uses 2760 distinct words, not counting 640 tokens 
    with unreadable/contentious characters ("*") in them.
    
  Word repeats:
  
    The Bencao contains 41 word repeats 
    (occurrences of ... X X ... consecutive pairs where X is any word).
    
    The most common words appearing in the Bencao repeats are

      8 pairs: big5 "¬~" = jis "?ô"
      6 pairs: big5 "¦å" = jis "??"
      5 pairs: big5 "´H" = jis "?¦"

    The VMS recipes contain 77 word repeats; extrapolating from 23 to
    27 pages, that would be 90 repeats.  The most common words in 
    those repeats are
    
     10 pairs: eva "qokeedy"
     10 pairs: eva "qokeey"
      7 pairs: eva "ar"

What do we make of these numbers?  Before drawing any conclusions,
we should note that:

  * The recipe boundaries are not always well-marked in the VMS.
    
    The four clues we could use are 
      (1) the last line is shorter, 
      (2) the first line starts or contains p/f gallows,
      (3) the last line ends with rarities such as "m", "g", "dl", etc. 
      (4) there is a star at the left of the first line. 
    
    On favorable pages, we find good correlation between these
    clues, especially (1), (2), and (4), except that the star 
    is often seen to be displaced vertically by as much as one line 
    from its correct position.  On some pages, on the other 
    hand, there are long blocks of text without clue (1),
    and the star positions often seem to be inconsistent with clues
    (2) and (3).  Sometimes even the number of stars seems to 
    be wrong.  For instance:
    
      on page f103v, star #8 seems to be too high (it implies a 1-line
      recipe followed by a 5.5-line one).
    
      on page f108v, stars #4, #7, #8, #10-#16 seem suspicious.

      on page f111r, there may be one star too many among #2--#11.

      on page f111v, there may be one star too many among starts #2--#10.
      Also, the scribe seems to have lost the sync in the bottom half.
      I would say that
      
        #13 is 1 line too high, 
        #14 is 2 lines too high, 
        #15 is one paragraph too high
        #16, #17 are 2-3 lines too high, and
        #18 is 1/2 line too high.
    
      on page f113r, the scribe either missed a star after #8, or
      shifted stars #9-#15 down by one paragraph, and #16 down by 4.5 lines
    
      on page f115r the scribe may have omitted a star
      after #5 and/or after #9.

    The Bencao recipe length histogram is fairly compact, with most
    recipes containing between 25 and 45 tokens. Mistakes in recipe
    boundaries would make the histogram wider, by creating and excess
    of shorter and longer recipes -- just as we see in the VMS
    histogram.

    Considering all these problems, the match between the two recipe
    length histograms seems actually surprisingly good. Note in
    particular that both histograms have roughly the same range, a
    sharp increase at 25 tokens, a linear decrease from 45 to 60
    tokens, and a small bump at >70 tokens. The latter is due to the
    following recipes

       Voynich:
        74 tokens: recipe 064 page 105r lines 16-22
        71 tokens: recipe 079 page 105v lines 32-38
        72 tokens: recipe 247 page 113r lines 45-51
                             
       Bencao:               
        72 tokens: recipe 099 page 01-13 lines b09-b10 and a01-a02
        92 tokens: recipe 109 page 01-15 lines a05-a09
        72 tokens: recipe 171 page 02-07 lines a02-a05

    However, note that these extra-long recipes may be accidental
    joins of two mormal-size ones (30-40 tokens), in which case it is
    pointless to try to match them.

  * The mapping between Chinese characters and VMS words may
    not be one-to-one.
    
    There are many factors that could cause discrepancies here. First,
    there is much uncertainty in the VMS transcription, especially on
    the pairs a/o, r/s, ch/ee, etc. There is even more uncertainty
    about word spaces: some tokens may have been incorrectly joined or
    split.
    
    Furthermore, the same Chinese character may be pronounced in
    different ways depending on the context: in particular, certain
    tones change into others, in a regular fashion, depending on the
    tone of the preceding syllable. The Chinese text does not register
    these chanesg, but a phonetic transcription would.
    
    Moreover, the same Chinese sound may have been written in several ways
    by the author.  This usually happens when one is taking dictation
    in an unfamiliar or unknown language.   Also, those frequent m/g 
    in line-final position suggest that the Vounichese text contains 
    many abbreviations.  Also, the use of p/f "capitals" in the Voynich
    script is expected to inflate the number of different words. 
    
    All these factors would increase the apparent size of the VMS
    vocabulary. Note that if 15% of the Voynichese tokens were
    affected by those factors, that would suffice to explain the
    discrepancy observed in the vocabulary sizes (1113 distinct words
    in the Bencao, 2760 in the VMS). Threfore, this discrepancy,
    although substantial, still doesn't quite prove that they are
    different texts.
    
    For the same reasons, even if VMS turns out to be a transcription
    of the Bencao, we should not expect to have a simple word-for-word
    match between the two texts.
    
In conclusion, I would say that the jury is still out on this one...

All the best,

--stolfi

References:
- RE: More Chinese stuff: Shennong Bencao
  - From: Jorge Stolfi

Prev by Date: Octavo
Next by Date: RE: More Chinese stuff: Shennong Bencao
Previous by thread: RE: More Chinese stuff: Shennong Bencao
Next by thread: RE: More Chinese stuff: Shennong Bencao
Index(es):
- Date
- Thread