[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Doubled words




    > [stolfi:] The columns are
    > 
    >   ndup   number of doublets in the text
    >   fdup   frequency of doublets relative to num of tokens
    >   topwd  the most frequent word appearing in those doublets
    >   ntd    count of "topwd topwd" doublets
    > 
    > [Rene:] are ndup and fdup based on the sum over all words, or
    > for the most commonly reduplicated word only?

The former:

  "ndup" is the number of occurrences of the pattern "X P X" in the
  text, where X is any valid word and P is either empty or punctuation.

  "fdup" is "ndup" divided by the number of tokens in the 
  text (about 35000 for all samples).

  "topwd" is the most common value of X among those "ndup" doublets.

  "ntd" is the number of doublets "X P X" where X = topwd.

    > So: what would fdup for the most commonly reduplicated
    > word be?

Approximately 44/35000 = .00125.  In other words, about once 
every 800 tokens one finds one occurrence of "ÀÑ ÀÑ",
on the Red Mansion novel, ignoring punctuation.

Presumably what you would like to know is, for each word X, the
probability "pdup(X)" of an occurrence of X being followed by a second
one --- which is 6% for "qokeey", according to Philip. That would be
the count of "X P X" doublets (22 for X = "chol", 20 for "daiin", 19
for "qokeedy" etc.) divided by the total occurrences of "X".
Note that the word with highest pdup(X) is not necessarily the "topwd"
listed in my table. I would like to know that too, but it will take
some more hacking. Maybe this weekend...

    > what is table-guessed Pinyin?
    
The only usable Chinese texts that I could find were in Chinese
characters (ideograms), not in Pinyin. There are many different
ideograms with the same pronounciation (think of English "too", "2",
"to"). Conversely, there are many ideograms which have more than one
Mandarin pronounciation (think of "read" and "record" and the letter
"z"). The situation is not as bad as in Japanese, where multiple
sounds are the rule, but definitely worse than English.

The data I have about pronunciation is basically the following table,
created by merging a handful of similar tables found in the net

# GB OUTPUT           # HEX  UNIC Frq.Ptt Frq.Red Frq.WWW Definition
# -- --------         - ---- ---- ------- ------- ------- ---------------------
  £¬ «_,_»            # A3AC FF0C   14442   58032       0 FULLWIDTH_COMMA
  ¡£ «.»              # A1A3 3002    6715   25450       0 IDEOGRAPHIC_FULL_STOP
  ÁË liao3            # C1CB 4E86    1661   20799  305165 {liao3,le5,liao4}
  µÄ de5              # B5C4 7684    9060   15124  581720 {de5,di2,di4}
  ²» bu4              # B2BB 4E0D    1614   14590  311442 {bu4,bu2}
  £º «_:_»            # A3BA FF1A       0   12782       0 FULLWIDTH_COLON
  Ò» yi1              # D2BB 4E00    1722   11745  363590 {yi1}
  À´ lai2             # C0B4 6765     983   11192  154591 {lai2}
  µÀ dao4             # B5C0 9053     176   10979  143820 {dao4}
  ÈË ren2             # C8CB 4EBA    2399   10276  212609 {ren2}

The first column is the ideogram code in the Mainland (Guo Biao, GB)
encoding, consisting of two bytes in the 128-255 range. The UNIC colum
is the corresponding Unicode value. The "Frq" columns are the counts
of that ideogram in the Union Pentateuch, in the Dream of a Red
Mansion, and in a modern usage table found in the net. The last
column tells whether it is a "special character" (punctuation, graphic
symbol, etc.) and, if not, what are the sounds given for it by one or
more of the input tables.

Presumably the correct sound depends on context and/or epoch and/or
subdialect. Some tones, in particular, are pronounced differently
depending on the tones of adjacent syllables; and some of that
variation may have found its way into the tables. (Tone "5" appears to be
vestigial/controversial, or may indicate "no definite tone" --- or
possibly both, depending on the input table.)

There are commercial tools out there which claim to convert ideograms
to usable pinyin, taking context into account; but I don't have access
to them. The second column is my best guess, what I called
"table-guessed pinyin": out of the Mandarin sounds listed in the last
column, I picked one by taking a weighted consensus of the various
input tables, breaking ties arbitrarily. 

It seems that my choice was wrong more often than not: for instance, 
I suspect that for "ÁË" = {liao3,le5,liao4}, the most common word in the
Red Mansion novel, the best guess should have been "le5" (a very
popular but hard to translate particle), not "liao3".

All the best,

--stolfi