[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: John Chadwick (Linear B) of corpus size. Comments invited.

On 11 May 00, at 22:59, Jacques Guy wrote:
> I would hate to see
> nonsense like "Chadwick's formula" fed to a wide
> readership. IF it is nonsense. I think it is, but I
> prefer not to trust my judgment. Comments, everybody?

I think that your example of a binary representation is excellent to 
show that  "a character" in an unknown script is a very non-intuitive 
issue. As a consequence Chadwick's formula may not apply 
because of the decipherer-to-be's inability to retrieve the character 
set. Further this is similar to consider in the Roman alphabet each 
stroke as a character, or Stolfi's superanalytical alphabet.

Where does Chadwick's formula come from? I have not idea.
If we imagine that we want to be sure to have read all alphabet 
characters at least once and their distribution is flat (and their 
probability of appearing is random), then this may be somewhat 
related as the "collector's dilemma" problem ( I can't remember the 
formulation right now but I am sure includes Euler's number).
(How many items you have to collect before your collection is 

Of course this is not the case in languages since the character 
distribution is not flat, etc but I wonder whether the size of the 
corpus that you need to make sure that at least everything 
appeared once could be calculated by accounting for the shape of 
the distribution of the characters.
But! this would be known only *after* we know what a character is 
and the size of the alphabet. In turn this will depend on what we call 
a character and therefore the size of the corpus may  be different. 

Would this mean that we never will be certain of what a character is 
in an unknown script?

I am more confused now... :-/