[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Duplication in Chinese
Two or three weeks ago I posted a table showing that the ten most
frequent words in a sample of the biological section frequently
occur in duplicated form (e.g. shedy.shedy) whereas the ten most
frequent words in a sample of Latin never occurred as duplicates.
I have now taken a 4730 word sample of spoken Chinese from this site
http://www.ocrat.com/ocrat/voa/Atxtpin.html
and made a similar analysis. It seems to show that duplication does
occur in Chinese, but not nearly as much as in the VMS biological
sample - even if you ignore Chinese tones. The results are summarised
in the attached file.
Philip Neal
_________________________________________________________________
Join the world?s largest e-mail service with MSN Hotmail.
http://www.hotmail.com
I have taken a sample from this corpus of Voice of America broadcasts
http://www.ocrat.com/ocrat/voa/Atxtpin.html
containg 4730 words of Mandarin Chinese in pinyin transcription and
analysed the frequency of the 10 most frequent words first including
and then ignoring the tone numbers. I also give the frequencies for
a 4089 word sample of the biological section of the VMS.
A absolute frequency B as doublet C as triplet
A B C
de5 180 0 0
guo2 129 1 0
shi4 91 1 0
he2 60 0 0
zhong1 57 0 0
ji4 48 0 0
yi1 42 0 0
shuo1 42 1 0
bu4 41 0 0
zai4 40 0 0
de 187 2 0
shi 146 3 0
guo 144 1 0
yi 102 3 0
ji 99 1 0
zhi 86 1 0
wei 76 0 0
he 60 0 0
jin 53 0 0
li 49 0 0
A B C
shedy 155 6 0
chedy 141 4 0
qokeedy 135 9 0
qokain 127 2 0
qokedy 110 6 2
ol 109 11 1
qokal 82 2 0
qokaiin 72 1 0
qokeey 71 1 0
chey 61 1 0