[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Doubled words
> [stolfi:] The columns are
>
> ndup number of doublets in the text
> fdup frequency of doublets relative to num of tokens
> topwd the most frequent word appearing in those doublets
> ntd count of "topwd topwd" doublets
>
> [Rene:] are ndup and fdup based on the sum over all words, or
> for the most commonly reduplicated word only?
The former:
"ndup" is the number of occurrences of the pattern "X P X" in the
text, where X is any valid word and P is either empty or punctuation.
"fdup" is "ndup" divided by the number of tokens in the
text (about 35000 for all samples).
"topwd" is the most common value of X among those "ndup" doublets.
"ntd" is the number of doublets "X P X" where X = topwd.
> So: what would fdup for the most commonly reduplicated
> word be?
Approximately 44/35000 = .00125. In other words, about once
every 800 tokens one finds one occurrence of "ÀÑ ÀÑ",
on the Red Mansion novel, ignoring punctuation.
Presumably what you would like to know is, for each word X, the
probability "pdup(X)" of an occurrence of X being followed by a second
one --- which is 6% for "qokeey", according to Philip. That would be
the count of "X P X" doublets (22 for X = "chol", 20 for "daiin", 19
for "qokeedy" etc.) divided by the total occurrences of "X".
Note that the word with highest pdup(X) is not necessarily the "topwd"
listed in my table. I would like to know that too, but it will take
some more hacking. Maybe this weekend...
> what is table-guessed Pinyin?
The only usable Chinese texts that I could find were in Chinese
characters (ideograms), not in Pinyin. There are many different
ideograms with the same pronounciation (think of English "too", "2",
"to"). Conversely, there are many ideograms which have more than one
Mandarin pronounciation (think of "read" and "record" and the letter
"z"). The situation is not as bad as in Japanese, where multiple
sounds are the rule, but definitely worse than English.
The data I have about pronunciation is basically the following table,
created by merging a handful of similar tables found in the net
# GB OUTPUT # HEX UNIC Frq.Ptt Frq.Red Frq.WWW Definition
# -- -------- - ---- ---- ------- ------- ------- ---------------------
£¬ «_,_» # A3AC FF0C 14442 58032 0 FULLWIDTH_COMMA
¡£ «.» # A1A3 3002 6715 25450 0 IDEOGRAPHIC_FULL_STOP
ÁË liao3 # C1CB 4E86 1661 20799 305165 {liao3,le5,liao4}
µÄ de5 # B5C4 7684 9060 15124 581720 {de5,di2,di4}
²» bu4 # B2BB 4E0D 1614 14590 311442 {bu4,bu2}
£º «_:_» # A3BA FF1A 0 12782 0 FULLWIDTH_COLON
Ò» yi1 # D2BB 4E00 1722 11745 363590 {yi1}
À´ lai2 # C0B4 6765 983 11192 154591 {lai2}
µÀ dao4 # B5C0 9053 176 10979 143820 {dao4}
ÈË ren2 # C8CB 4EBA 2399 10276 212609 {ren2}
The first column is the ideogram code in the Mainland (Guo Biao, GB)
encoding, consisting of two bytes in the 128-255 range. The UNIC colum
is the corresponding Unicode value. The "Frq" columns are the counts
of that ideogram in the Union Pentateuch, in the Dream of a Red
Mansion, and in a modern usage table found in the net. The last
column tells whether it is a "special character" (punctuation, graphic
symbol, etc.) and, if not, what are the sounds given for it by one or
more of the input tables.
Presumably the correct sound depends on context and/or epoch and/or
subdialect. Some tones, in particular, are pronounced differently
depending on the tones of adjacent syllables; and some of that
variation may have found its way into the tables. (Tone "5" appears to be
vestigial/controversial, or may indicate "no definite tone" --- or
possibly both, depending on the input table.)
There are commercial tools out there which claim to convert ideograms
to usable pinyin, taking context into account; but I don't have access
to them. The second column is my best guess, what I called
"table-guessed pinyin": out of the Mandarin sounds listed in the last
column, I picked one by taking a weighted consensus of the various
input tables, breaking ties arbitrarily.
It seems that my choice was wrong more often than not: for instance,
I suspect that for "ÁË" = {liao3,le5,liao4}, the most common word in the
Red Mansion novel, the best guess should have been "le5" (a very
popular but hard to translate particle), not "liao3".
All the best,
--stolfi