[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: John Chadwick (Linear B) of corpus size. Comments invited.



Jacques Guy writes:
> Yes, my point was that Chadwick's formula is dead
> wrong. However, I would like other opinions. 

OK, here is my $0.02. The formula may not be as wrong as everybody here
seems to suppose, at least if we approach it with a little goodwill. First,
it should be taken to apply only to "NYN" type languages. Robinson must be
made aware of the incredible importance of getting the language right, but 
once we have that much, it's not that hard to justify the formula. 

The 0th step of the analysis is to arrange the symbols in frequency order.
The 1st step is to construct a grid of what can follow what. As we try to 
grok the pattern, it is this grid that gets rearranged over and over again:
something that is extremely hard to do with higher order statistics because 
we don't really have the visual means to deal with higher dimensional grids. 

Given our human limitations in constructing, displaying, and comprehending 
higher order data, it is likely that 1st order statistics will contimue to
play a very significant role in solving the puzzle, even if computers can 
store (and selectively display) the higher order material with ease. 

Now, to fill in a bigram grid with any chance of random fluctuations not
totally overwhelming the true pattern we need n^2 data points (actually, some
constant time n^2 is better), so there is a fair bit of engineering wisdom 
in the formula.

This of course applies only to ordinary phoneme- mora- or syllable-based
scripts, where the usual goal (if systems created by a long evolutionary
process can be said to have a goal) is to map sounds to symbols in a simple
fashion. For the VMS, we don't even know whether there is a spoken system
behind it (though I personally strongly suspect there is) and the goal of the
script seems to be to delibarately obscure, rather than plainly present, the
relationship between sounds and symbols (so a corpus larger than n^2 should be
required). 

I think this goes a long way towards explaining why the kind of binary
encoding suggested as a counterexamle renders the formula meaningless: once 
such an encoding is performed the relationship between the symbols and the 
sounds is anything but straightforward. 

Andras Kornai