[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Duplication in Chinese



Two or three weeks ago I posted a table showing that the ten most
frequent words in a sample of the biological section frequently
occur in duplicated form (e.g. shedy.shedy) whereas the ten most
frequent words in a sample of Latin never occurred as duplicates.

I have now taken a 4730 word sample of spoken Chinese from this site
http://www.ocrat.com/ocrat/voa/Atxtpin.html
and made a similar analysis. It seems to show that duplication does
occur in Chinese, but not nearly as much as in the VMS biological
sample - even if you ignore Chinese tones. The results are summarised
in the attached file.

Philip Neal






_________________________________________________________________
Join the world?s largest e-mail service with MSN Hotmail. http://www.hotmail.com
I have taken a sample from this corpus of Voice of America broadcasts

http://www.ocrat.com/ocrat/voa/Atxtpin.html

containg 4730 words of Mandarin Chinese in pinyin transcription and
analysed the frequency of the 10 most frequent words first including
and then ignoring the tone numbers. I also give the frequencies for
a 4089 word sample of the biological section of the VMS.

A absolute frequency B as doublet C as triplet

A B C

de5	180	0	0
guo2	129	1	0
shi4	91	1	0
he2	60	0	0
zhong1	57	0	0
ji4	48	0	0
yi1	42	0	0
shuo1	42	1	0
bu4	41	0	0
zai4	40	0	0

de	187	2 	0
shi	146	3 	0
guo	144	1 	0
yi	102	3 	0
ji	99	1 	0
zhi	86	1 	0
wei	76	0	0
he	60	0	0
jin	53	0	0
li	49	0	0


A B C


shedy	155	6	0
chedy	141	4	0
qokeedy	135	9	0
qokain	127	2	0
qokedy	110	6	2
ol	109	11	1
qokal	82	2	0
qokaiin	72	1	0
qokeey	71	1	0
chey	61	1	0