[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

The Hubbard statistics about copy(-1), copy(-2)... explained



Since no-one has taken up my challenge of figuring out what
was behind those frequencies of copy(-1), copy(-2) and so
on... this poor overworked frog has to do it, when it would
rather bask in a nice drizzle on the lily pond. Oh well...
here we go.

Take a standard deck of cards, pick a card. Look at it. A
king. Pick another card. Do not look at it yet. What is the
chance of it being another king? Easy: 3 kings left in the
deck, out of 53 cards (52 plus the jokers, less the king you
drew), makes it 3/53.

Question now. Does knowing where you picked that card (say,
eighth card after the king) help you guess what it is? No,
if the deck has been shuffled. Yes, if it has been arranged
in some particular order.

Now take a standard plaintext story, and count the words in
it. Now pick a word. Look at it. It's "pick". Pick another
word. Do not look at it yet. What is the chance of it being
another king -- er... "pick"? Easy: the text has N words,
and "pick" occurs n times. So, there are n-1 picks left in
N-1 words which makes it (n-1)/(N-1).

Does knowing where you picked that word (say, eighth word
after "pick") help you guess what it is, whether it is
"pick" again? Yes, if the text is arranged in some
particular order. Otherwise, no.

What sort of "particular order" increases the chance of the
same word occurring exactly p positions apart? The order
found in lists, for instance:

war chariots, both wheels sound: 5 items
war chariots, left wheel broken: 8 items
war chariots, right wheel broken: 1 item
female slaves, no teeth missing: 12 items

..... and so on. (Example inspired by an old post of Robert
Firth's).

What sort of "particular order" _decreases_ the chance of
a word occurring exactly p positions apart?

This is a harder question. We know that the particular order
imposed upon English by its grammar _decreases_ the chance
of the same word occurring twice in a row. But Malay is not
constrained in this way.

Yet, when you think about English, there seems to be no
particular reason for a decrease or an increase of the
chance of the same word occurring p positions apart when
p=2, 3, 4,... infinity.

Think about Arabic or Hebrew now. Whereas we say in English
"the big apple" Arabic and Hebrew say "the apple the big".
This increases the chance of "the" (rather, al in Arabic, ha
in Hebrew) occurring 2 positions apart. But this is only
true of "the" (al, ha), not of the other words of those
languages.

Now take the language of Easter Island, and consider these:

on the boat: i runga i te miro
in the boat: i roto i te miro
onto the boat: ki runga ki te miro
into the boat: ki roto ki te miro

(I'll spare you the rest of the collection)

Here, we observe that a small set of words (i, ki...) occurs
more often than expected two positions apart.

These statistics reflect part of "what makes this language
tick," an expression I vastly prefer to "the grammar of this
language" because I no longer believe in syntax nor in
grammar, except as a figment of linguists' imaginations.

What I wrote above is, to my knowledge, original. I have
never seen the like of it in linguistic theories. I also
believe that, if followed up, it would be quite fruitful for
the analysis of languages and that it would lead to a better
model of what makes languages tick. It is owed to the study
of this vexing medieval manuscript (does not VMS also stand
for "Vexing Manuscript"?). So here I am again, with my old
saw: deciphering the VMS is minor, what is important is what
we discover on the way.