[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Doubled words
> [Philip Neal:] If the current word is qokeey, there is a 6%
> chance that the next word will be qokeey. - [This] distribution
> is not characteristic of names, is very characteristic of all
> the high frequency Voynich words, and is strong evidence for
> Currier's view that the words are not words at all.
The repetitions of "qokeey" are indeed exceptional, but they don't prove
the concludion. After all, only a few VMS words behave like that.
Moreover, repetitive names *do* occur in some languages: "Sing Sing",
"Bora Bora", "Ping-Ping" (the name of a Chinese friend of mine), ...
The very first sample of Chinese in Pinyin that I found on the net had
"yi1 ba1 yi1 yi1 yi1", right in the first paragraph. There "yi1" meant
"one", and the translation was "... in 1811. One of ....".
> [Bob Richmond:] Another possibility is that we're looking at a
> language - and they're fairly common across the world - that
> forms plurals by doubling the singular form of the word.
Consider also that a herbal written in a language with subjet-verb-object
structure could have many constructions of this sort:
Dioscorides had high regard for this HERB. The HERB grows ...
Now suppose that the language has no articles, and the text
is written without punctuation...
(For all I know, Vietnamese and Tibetan have no articles and mostly
S-V-O sentence structure. Chinese lacks articles too but its sentence
structure is mostly S-O-V. On the other hand Chinese has plenty of
doublets for other reasons. See below.)
I looked for doublets (consecutive word repeats, ignoring punctuation)
in some of my reference texts, see the table below. The columns are
ndup number of doublets in the text
fdup frequency of doublets relative to num of tokens
topwd the most frequent word appearing in those doublets
ntd count of "topwd topwd" doublets
All texts were truncated so as to approximately match the VMS
non-label valid token count (35027). Note that some texts were much
longer than that -- the Vietnamese Bible sample, for instance, got
truncated at GEN:47:9; and the Greek sample consists of the first 45%
of each Gospel (Matthew, Mark, Luke and John).
sample language book ndup fdup topwd ntd
-------- ---------- ----------------------- ---- ------ ---------- ---
chin/red Mandarin Dream_of_Red_Mansion 351 .01002 lao3 (*) 44
voyn/tak Voynichese Takahashi's_sans_labels 316 .00835 chol ($) 22
chin/ptt Mandarin Union_Pentateuch 145 .00414 ge1 (ø) 44
tibe/ccv Tibetan Comm_Comm_Valid_Reason 90 .00257 MA (@) 54
grek/nwt Greek Byzantine_New_Testament 63 .00180 amën (#) 16
viet/ptt Vietnamese Cadman_Pentateuch 48 .00137 ddo+`i (§) 7
tibe/vim Tibetan Vimalakirti_Sutra 28 .00080 DE (&) 12
geez/gok Ethiopian Glory_of_the_Kings 17 .00048 'alElene 8
engl/wow English War_of_the_Worlds 16 .00046 had 3
span/qvi Spanish Don_Quijote_old_spellg 12 .00034 el (%) 3
engl/cul English Culpeper's_Herbal 11 .00031 it 4
latn/ptt Latin Vulgate_Pentateuch 9 .00026 septena 2
-------- ---------- ----------------------- ---- ------ ---------- ---
Notes:
($) Here are some "chol" doublets in voyn/tak:
f1r.P3.15;H chor shey kol chol chol kor chal sho
f8v.P.5;H shealy daiin chary chol chol dar otchar etaiin
f8v.P.8;H ry okchol ksh chol chol chol cthaiin dain
f8v.P.8;H okchol ksh chol chol chol cthaiin dain shol
f15v.P.9;H shol daiin otcholocthol chol chol chody kan sor
f93v.P.4;H shdchy qokchol qokchody chol chol cty ykchy dar
Here are the 10 most common doublet words in voyn/tak, if I can
believe my scripts:
count word
----- --------
22 chol
20 daiin
19 qokeedy
14 qokedy
12 qokeey
11 chedy
10 ar
9 ol
8 dy
8 shedy
----------------------------------------------------------------------
(*) "lao3" is table-guessed pinyin for "ÀÑ" (GB encoding).
Its first doublet in chin/red is shown below (bracketed):
Òò ¹· ¶ù °× ÈÕ ¼ä ÓÖ ×÷ Щ Éú ¼Æ £¬ Áõ ÊÏ ÓÖ ²Ù ¾® ¾Ê µÈ Ê £¬
Çà °å æ¢ µÜ Á½ ¸ö ÎÞ ÈË ÕÕ ¹Ü £¬ ¹· ¶ù Ëì
½« ÔÀ ĸ Áõ [ ÀÑ ÀÑ ] ½Ó À´ £¬ Ò» ´¦ ¹ý »î ¡£
The next most common doubled words are Ì« "tai4" (32 doublets),
ÃÃ "mei4" (21), ÄÌ "nai3" (17). (All pinyin readings are table guesses.)
----------------------------------------------------------------------
(ø) "ge1" is table-guessed pinyin for "¸ç" (GB encoding).
Its first doublet in chin/ptt is shown below (bracketed):
# GEN:10:21
ÑÅ ¸¥ µÄ [ ¸ç ¸ç ] ÉÁ £¬ ÊÇ Ï£ ²® ×Ó Ëï Ö® ×æ £¬ Ëû Ò² Éú ÁË ¶ù ×Ó ¡£
#
# Unto Shem also, the father of all the children of Eber, the brother
# of Japheth the elder, even to him were children born.
The next most common doubled words are ÎÒ "wo3" (10 doublets)
Äã "ni3" (8) Ëû "ta1" (8). (All pinyin readings are table guesses.)
----------------------------------------------------------------------
(@) Some of the 54 "MA" doublets in tibe/ccv:
BA'I PHYIR TSAD MA MA YIN NO ZHA
PA YANG TSAD MA MA YIN PAR 'GYUR
PA NI TSAD MA MA YIN TE SLU
GAL TE TSAD MA MA YIN NA , JI
BA NYID TSAD MA MA YIN TE DON
BA LA TSAD MA MA YIN NO ,, YANG
The next most common doubled words are
"SO" (10 doublets), "DE" (9), "RE" (5).
----------------------------------------------------------------------
(&) Some of the 12 "DE" doublets in tibe/vim:
TU ZHI MDZAD DE ,, DE NI RGYAL BA'I
TU SEMS BSKYED DE , DE GNYIS KYIS SKYES
LTAR STON PA DE DE BZHIN TE , 'ON
KYI SGO YOD DE , DE LA NAN TAN
The next most common doubled words are "MA" (3 doublets),
"SO" (3) and "GLANG" (2).
----------------------------------------------------------------------
(§) Here are some doublets from viet/ptt ("dd" = crossed-"d";
diacritics apply to previous letter -- "+" = horn, "(" = brevis,
"." = dot-below, "~" = tilde, "?" = curl; rest should be obvious)
GEN:03:22 va` ddu+o+.c so^'ng ddo+`i ddo+`i cha(ng . gie^ ho^
GEN:07:18 no^?i tre^n ma(.t nu+o+'c . nu+o+'c ca`ng du+ng le^n
GEN:08:03 kho?i ma(.t dda^'t , la^`n la^`n vu+`a ha. vu+`a
GEN:08:05 ra't .= nu+o+'c cu+' la^`n la^`n ha. cho dde^'n
GEN:09:12 qua ca'c ddo+`i ma~i ma~i . ta dda(.t mo^'ng
GEN:09:16 su+. giao u+o+'c ddo+`i ddo+`i cu?a ddu+'c chu'a
GEN:10:11 ro^`i la^.p tha`nh ni ni ve , re^ ho^
GEN:10:12 giu+~a khoa?ng tha`nh ni ni ve va` ca
GEN:11:29 na co^ cu+o+'i vo+. ; vo+. a'p ram te^n
GEN:12:09 vu+`a ddo'ng tra.i la^`n la^`n dde^'n nam phu+o+ng
GEN:12:19 nha^.n la^'y va` ddi ddi . ddoa.n , pha ra
GEN:13:15 do`ng do~i ngu+o+i ddo+`i ddo+`i . ta se~ la`m
GEN:14:01 vua si ne^ a ; a ri o'c , vua
GEN:14:24 re^ ; ve^` pha^`n ho. , ho. ha~y la^'y pha^`n
GEN:16:12 ngu+o+`i ddi.ch la.i no' . no' se~ o+? ve^`
GEN:17:07 la` giao u+o+'c ddo+`i ddo+`i , ha^`u cho ta
GEN:17:08 la`m co+ nghie^.p ddo+`i ddo+`i . va^.y , ta se~
GEN:17:13 ta se~ la^.p ddo+`i ddo+`i trong xa'c thi.t
GEN:17:16 ban phu+o+'c cho na`ng , na`ng se~ la`m me.
The most popular doubled words after "ddo+`i" are
"mau" (4 doublets), "la^`n" (3), "na`ng" (3), "ngu+o+i" (3).
----------------------------------------------------------------------
(#) Here are some grek/nwt doublets ("ë" = eta, "ô" = omega, "ð" = theta)
MAT 01:01 uiou dauid uiou abraam abraam egennësen ton isaak
MAT 01:02 abraam egennësen ton isaak isaak de egennësen ton
MAT 01:02 de egennësen ton iakôb iakôb de egennësen ton
MAT 01:03 de egennësen ton esrôm esrôm de egennësen ton
MAT 01:03 de egennësen ton aram aram de egennësen ton
LUK 06:46 de me kaleite kurie kurie kai ou poieite
LUK 06:47 umin tini estin omoios omoios estin anðrôpô oikodomounti
LUK 07:31 kai tini eisin omoioi omoioi eisin paidiois tois
JHN 01:51 kai legei autô amën amën legô umin ap
JHN 03:03 kai eipen autô amën amën legô soi ean
JHN 03:05 gennëðënai apekriðë iësous amën amën legô soi ean
JHN 03:11 tauta ou ginôskeis amën amën legô soi oti
All the 16 "amën" doublets are in John. The next most common
doubled words after "amën" are "kurie" (3 doublets) and "iakôb" (2).
Beware that the some of these doublets may be text processing errors.
----------------------------------------------------------------------
(%) In span/qvi, "el" is both article ("the") and oblique pronoun ("him").
A sample "el" doublet is
y puso en EL EL hierro que quitó
and [he] put on HIM THE shackles which he took
----------------------------------------------------------------------
Note the difference between the two chinese samples: the classic novel
"Dream of the Red Mansion" (~1750) has more than twice as many
doublets as the Union Pentateuch. The difference may be due to subject
matter, of course: for one thing, duplication seems to be relatively
common in Chinese personal names, which do not occur in the Bible.
Another possible explanation is that the Bible was presumably
translated by Western missionaries, who presumably had an unconscious
bias against repetition (generally deprecated in Western literary
standards).
In either case, it is unfortunate that the only Vietnamese sample I
have is a translation of the Bible. I am still looking for a better
Vietnamese electronic text (native author, prose, not too many errors,
at least 35000 words). If you know of such thing, please tell me...
The Tibetan samples may have similar problems: tibe/vim (the "Sutra of
Vimalakirti") is an ancient translation from a Sanskrit original (ca.
500 BCE), and the same may be true of tibe/ccv ("A Commentary on a
Commentary on the Sutra of Valid Reasoning", ca. 1700).
I have yet to find any usable sample of Burmese, which is another
major member of the same family and a possible candidate under the
"Chinese Theory". (The Portuguese had already reached Burma/Myanmar
and Vietnam by 1520. Unfortunately that is all I could find about
those contacts.)
An herbal treatise in any of those languages would be most useful too.
----------------------------------------------------------------------