[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Doubled words

To: voynich@xxxxxxxx
Subject: Doubled words
From: Jorge Stolfi <stolfi@xxxxxxxxxxxxx>
Date: Wed, 13 Feb 2002 20:34:51 -0200 (EDT)
In-reply-to: <cf.127d68a2.2995336a@aol.com>
References: <cf.127d68a2.2995336a@aol.com>
Reply-to: stolfi@xxxxxxxxxxxxx
    > [Philip Neal:] If the current word is qokeey, there is a 6%
    > chance that the next word will be qokeey. - [This] distribution
    > is not characteristic of names, is very characteristic of all
    > the high frequency Voynich words, and is strong evidence for
    > Currier's view that the words are not words at all.

The repetitions of "qokeey" are indeed exceptional, but they don't prove
the concludion.  After all, only a few VMS words behave like that.
Moreover, repetitive names *do* occur in some languages: "Sing Sing",
"Bora Bora", "Ping-Ping" (the name of a Chinese friend of mine), ...

The very first sample of Chinese in Pinyin that I found on the net had
"yi1 ba1 yi1 yi1 yi1", right in the first paragraph.  There "yi1" meant
"one", and the translation was  "... in 1811. One of ....".

    > [Bob Richmond:] Another possibility is that we're looking at a
    > language - and they're fairly common across the world - that
    > forms plurals by doubling the singular form of the word.
   
Consider also that a herbal written in a language with subjet-verb-object
structure could have many constructions of this sort:

  Dioscorides had high regard for this HERB.  The HERB grows ...

Now suppose that the language has no articles, and the text
is written without punctuation... 

(For all I know, Vietnamese and Tibetan have no articles and mostly
S-V-O sentence structure. Chinese lacks articles too but its sentence
structure is mostly S-O-V. On the other hand Chinese has plenty of
doublets for other reasons. See below.)

I looked for doublets (consecutive word repeats, ignoring punctuation)
in some of my reference texts, see the table below.  The columns are

  ndup   number of doublets in the text
  fdup   frequency of doublets relative to num of tokens
  topwd  the most frequent word appearing in those doublets
  ntd    count of "topwd topwd" doublets

  All texts were truncated so as to approximately match the VMS
  non-label valid token count (35027). Note that some texts were much
  longer than that -- the Vietnamese Bible sample, for instance, got
  truncated at GEN:47:9; and the Greek sample consists of the first 45%
  of each Gospel (Matthew, Mark, Luke and John).

  sample   language   book                    ndup   fdup  topwd      ntd
  -------- ---------- ----------------------- ---- ------  ---------- ---
  chin/red Mandarin   Dream_of_Red_Mansion     351 .01002  lao3   (*)  44
  voyn/tak Voynichese Takahashi's_sans_labels  316 .00835  chol   ($)  22
  chin/ptt Mandarin   Union_Pentateuch         145 .00414  ge1    (ø)  44
  tibe/ccv Tibetan    Comm_Comm_Valid_Reason    90 .00257  MA     (@)  54   
  grek/nwt Greek      Byzantine_New_Testament   63 .00180  amën   (#)  16
  viet/ptt Vietnamese Cadman_Pentateuch         48 .00137  ddo+`i (§)   7
  tibe/vim Tibetan    Vimalakirti_Sutra         28 .00080  DE     (&)  12
  geez/gok Ethiopian  Glory_of_the_Kings        17 .00048  'alElene     8 
  engl/wow English    War_of_the_Worlds         16 .00046  had          3
  span/qvi Spanish    Don_Quijote_old_spellg    12 .00034  el     (%)   3
  engl/cul English    Culpeper's_Herbal         11 .00031  it           4
  latn/ptt Latin      Vulgate_Pentateuch         9 .00026  septena      2
  -------- ---------- ----------------------- ---- ------  ---------- ---

  Notes:
  
  ($) Here are some "chol" doublets in voyn/tak:
  
    f1r.P3.15;H             chor shey kol chol chol kor chal sho
    f8v.P.5;H          shealy daiin chary chol chol dar otchar etaiin
    f8v.P.8;H               ry okchol ksh chol chol chol cthaiin dain
    f8v.P.8;H             okchol ksh chol chol chol cthaiin dain shol
    f15v.P.9;H    shol daiin otcholocthol chol chol chody kan sor
    f93v.P.4;H    shdchy qokchol qokchody chol chol cty ykchy dar
  
  Here are the 10 most common doublet words in voyn/tak, if I can
  believe my scripts:

    count word
    ----- --------
       22 chol
       20 daiin
       19 qokeedy
       14 qokedy
       12 qokeey
       11 chedy
       10 ar
        9 ol
        8 dy
        8 shedy

  ----------------------------------------------------------------------
  (*) "lao3" is table-guessed pinyin for "ÀÑ" (GB encoding). 
  Its first doublet in chin/red is shown below (bracketed):

    Òò ¹· ¶ù °× ÈÕ ¼ä ÓÖ ×÷ Ð© Éú ¼Æ £¬ Áõ ÊÏ ÓÖ ²Ù ¾® ¾Ê µÈ ÊÂ £¬
    Çà °å æ¢ µÜ Á½ ¸ö ÎÞ ÈË ÕÕ ¹Ü £¬ ¹· ¶ù Ëì
    ½« ÔÀ Ä¸ Áõ [ ÀÑ ÀÑ ] ½Ó À´ £¬ Ò» ´¦ ¹ý »î ¡£
    
  The next most common doubled words are Ì« "tai4" (32 doublets),
  ÃÃ "mei4" (21), ÄÌ "nai3" (17). (All pinyin readings are table guesses.)

  ----------------------------------------------------------------------
  (ø) "ge1" is table-guessed pinyin for "¸ç" (GB encoding).
  Its first doublet in chin/ptt is shown below (bracketed):

    # GEN:10:21
      ÑÅ ¸¥ µÄ [ ¸ç ¸ç ] ÉÁ £¬ ÊÇ Ï£ ²® ×Ó Ëï Ö® ×æ £¬ Ëû Ò² Éú ÁË ¶ù ×Ó ¡£
    #
    # Unto Shem also, the father of all the children of Eber, the brother
    # of Japheth the elder, even to him were children born.

  The next most common doubled words are ÎÒ "wo3" (10 doublets)
  Äã "ni3" (8) Ëû "ta1" (8).  (All pinyin readings are table guesses.)
  
  ----------------------------------------------------------------------
  (@) Some of the 54 "MA" doublets in tibe/ccv:

    BA'I PHYIR TSAD MA MA YIN NO ZHA
       PA YANG TSAD MA MA YIN PAR 'GYUR
         PA NI TSAD MA MA YIN TE SLU
        GAL TE TSAD MA MA YIN NA , JI
       BA NYID TSAD MA MA YIN TE DON
         BA LA TSAD MA MA YIN NO ,, YANG
         
  The next most common doubled words are  
  "SO" (10 doublets), "DE" (9), "RE" (5).

  ----------------------------------------------------------------------
  (&) Some of the 12 "DE" doublets in tibe/vim:

      TU ZHI MDZAD DE ,, DE NI RGYAL BA'I
    TU SEMS BSKYED DE ,  DE GNYIS KYIS SKYES
      LTAR STON PA DE    DE BZHIN TE , 'ON
       KYI SGO YOD DE ,  DE LA NAN TAN
       
  The next most common doubled words are "MA" (3 doublets),
  "SO" (3) and "GLANG" (2).

  ----------------------------------------------------------------------
  (§) Here are some doublets from viet/ptt ("dd" = crossed-"d"; 
  diacritics apply to previous letter -- "+" = horn, "(" = brevis,
  "." = dot-below, "~" = tilde, "?" = curl; rest should be obvious)
  
    GEN:03:22  va` ddu+o+.c so^'ng ddo+`i   ddo+`i cha(ng . gie^ ho^
    GEN:07:18   no^?i tre^n ma(.t nu+o+'c . nu+o+'c ca`ng du+ng le^n
    GEN:08:03  kho?i ma(.t dda^'t , la^`n   la^`n vu+`a ha. vu+`a
    GEN:08:05  ra't .= nu+o+'c cu+' la^`n   la^`n ha. cho dde^'n
    GEN:09:12        qua ca'c ddo+`i ma~i   ma~i . ta dda(.t mo^'ng
    GEN:09:16     su+. giao u+o+'c ddo+`i   ddo+`i cu?a ddu+'c chu'a
    GEN:10:11       ro^`i la^.p tha`nh ni   ni ve , re^ ho^
    GEN:10:12    giu+~a khoa?ng tha`nh ni   ni ve va` ca
    GEN:11:29         na co^ cu+o+'i vo+. ; vo+. a'p ram te^n
    GEN:12:09    vu+`a ddo'ng tra.i la^`n   la^`n dde^'n nam phu+o+ng
    GEN:12:19        nha^.n la^'y va` ddi   ddi . ddoa.n , pha ra
    GEN:13:15   do`ng do~i ngu+o+i ddo+`i   ddo+`i . ta se~ la`m
    GEN:14:01                vua si ne^ a ; a ri o'c , vua
    GEN:14:24       re^ ; ve^` pha^`n ho. , ho. ha~y la^'y pha^`n
    GEN:16:12    ngu+o+`i ddi.ch la.i no' . no' se~ o+? ve^`
    GEN:17:07      la` giao u+o+'c ddo+`i   ddo+`i , ha^`u cho ta
    GEN:17:08    la`m co+ nghie^.p ddo+`i   ddo+`i . va^.y , ta se~
    GEN:17:13         ta se~ la^.p ddo+`i   ddo+`i trong xa'c thi.t
    GEN:17:16      ban phu+o+'c cho na`ng , na`ng se~ la`m me.
    
  The most popular doubled words after "ddo+`i" are
  "mau" (4 doublets), "la^`n" (3), "na`ng" (3), "ngu+o+i" (3). 

  ----------------------------------------------------------------------
  (#) Here are some grek/nwt doublets ("ë" = eta, "ô" = omega, "ð" = theta)

      MAT 01:01          uiou dauid uiou abraam abraam egennësen ton isaak
      MAT 01:02      abraam egennësen ton isaak isaak de egennësen ton
      MAT 01:02          de egennësen ton iakôb iakôb de egennësen ton
      MAT 01:03          de egennësen ton esrôm esrôm de egennësen ton
      MAT 01:03           de egennësen ton aram aram de egennësen ton

      LUK 06:46             de me kaleite kurie kurie kai ou poieite
      LUK 06:47          umin tini estin omoios omoios estin anðrôpô oikodomounti
      LUK 07:31           kai tini eisin omoioi omoioi eisin paidiois tois

      JHN 01:51             kai legei autô amën amën legô umin ap
      JHN 03:03             kai eipen autô amën amën legô soi ean
      JHN 03:05 gennëðënai apekriðë iësous amën amën legô soi ean
      JHN 03:11         tauta ou ginôskeis amën amën legô soi oti

    All the 16 "amën" doublets are in John. The next most common
    doubled words after "amën" are "kurie" (3 doublets) and "iakôb" (2).
    Beware that the some of these doublets may be text processing errors.

  ----------------------------------------------------------------------
  (%) In span/qvi, "el" is both article ("the") and oblique pronoun ("him").
  A sample "el" doublet is 

    y puso en EL EL hierro que quitó
    and [he] put on HIM THE shackles which he took

  ----------------------------------------------------------------------

Note the difference between the two chinese samples: the classic novel
"Dream of the Red Mansion" (~1750) has more than twice as many
doublets as the Union Pentateuch. The difference may be due to subject
matter, of course: for one thing, duplication seems to be relatively
common in Chinese personal names, which do not occur in the Bible.

Another possible explanation is that the Bible was presumably
translated by Western missionaries, who presumably had an unconscious
bias against repetition (generally deprecated in Western literary
standards).

In either case, it is unfortunate that the only Vietnamese sample I
have is a translation of the Bible. I am still looking for a better
Vietnamese electronic text (native author, prose, not too many errors,
at least 35000 words). If you know of such thing, please tell me...

The Tibetan samples may have similar problems: tibe/vim (the "Sutra of
Vimalakirti") is an ancient translation from a Sanskrit original (ca.
500 BCE), and the same may be true of tibe/ccv ("A Commentary on a
Commentary on the Sutra of Valid Reasoning", ca. 1700).

I have yet to find any usable sample of Burmese, which is another
major member of the same family and a possible candidate under the
"Chinese Theory". (The Portuguese had already reached Burma/Myanmar
and Vietnam by 1520. Unfortunately that is all I could find about
those contacts.)

An herbal treatise in any of those languages would be most useful too.

----------------------------------------------------------------------
Follow-Ups:
- Re: Doubled words
  - From: Rene Zandbergen
References:
- Re: qokeey (Transition between languages A and B)
  - From: RSRICHMOND
Prev by Date: RE: The other Mrs Sforza...
Next by Date: Re: Doubled words
Previous by thread: Re: qokeey (Transition between languages A and B)
Next by thread: Re: Doubled words
Index(es):
- Date
- Thread