[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Diringer's "imprecision" and copy(-daiin) was: intercultural artefact
26/01/02 11:25:05, "Gabriel Landini" <G.Landini@xxxxxxxxxx> wrote:
>But wouldn't the statistics be somehow influenced by the word frequency distribution as
>Aren't common words like "the, and, for, to" the ones that are the main culprits of not
being able to
Words like "student" are not likely to be repeated either,
unlike in my Indonesian example (mahasiswa). You will find them
repeated in English (and in most of our languages) only at the
juncture of two clauses or sentences. I think I know how this can be
accounted for, for I have thought a little about it.
You can pull the same trick with "to", but you have
to resort to a different trick with "and" and "the" because,
in English, they are normally followed by something. It is
only in very rare occasions that you will read such things as...
as... oh, I have it on the... the... er... the tip of my tongue...
I have not looked at the VMS for a long, long, long
time (eh eh eh), but I distinctly remember that it was characterized
by frequent word reduplication, and I did remark upon it.
Consider now the quasi pathological case of Hebrew, where
you will observe many occurrences of the article "ha"
exactly two positions apart because Hebrew says "the question
the vexing" where English says the "the vexing question".
If we analyze the Hebrew article not as a separate word,
but as a prefix, this peculiar statistical property of Hebrew
disappears. Consider now Bantu languages, eg Swahili:
kiboko kikubwa kimoja = one (moja) big (kubwa) hippopotamus
(boko). Nothing out of the ordinary if we spell it like that
(which is the standard spelling). But if we separate the
prefix ki-, we get: ki boko ki kubwa ki moja, and you know
by now what this does to the "copy(-2)" statistics. I wonder
if there is something there. In fact, I do not wonder, I am
pretty sure that there is something there.
If you are speaking of more than one hippopotamus, BTW, ki
becomes vi: viboko vikubwa viwili (wili = two).
This, I think, is a much more relevant, much more potent,
statistics than the entropy. The entropy, after all, is
the theoretical minimal cost of transmitting a message
assuming the knowledge of the frequency of the n-tuplets
in the language of that message. This cost is (to me)
evidently dependent upon how the message was encoded.
Thus, if you have a language where long vowels are
expressed by reduplication e.g. ariiki, its entropy
will differ from that of the very same language with its
long vowels are expressed by, say, capitalizing: arIki.
I have just a gut feeling that this "copy(-n)" statistics,
for lack of a better term, is much, much more informative.