[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: John Chadwick (Linear B) of corpus size. Comments invited.
On 11 May 00, at 22:59, Jacques Guy wrote:
> I would hate to see
> nonsense like "Chadwick's formula" fed to a wide
> readership. IF it is nonsense. I think it is, but I
> prefer not to trust my judgment. Comments, everybody?
I think that your example of a binary representation is excellent to
show that "a character" in an unknown script is a very non-intuitive
issue. As a consequence Chadwick's formula may not apply
because of the decipherer-to-be's inability to retrieve the character
set. Further this is similar to consider in the Roman alphabet each
stroke as a character, or Stolfi's superanalytical alphabet.
Where does Chadwick's formula come from? I have not idea.
If we imagine that we want to be sure to have read all alphabet
characters at least once and their distribution is flat (and their
probability of appearing is random), then this may be somewhat
related as the "collector's dilemma" problem ( I can't remember the
formulation right now but I am sure includes Euler's number).
(How many items you have to collect before your collection is
complete).
Of course this is not the case in languages since the character
distribution is not flat, etc but I wonder whether the size of the
corpus that you need to make sure that at least everything
appeared once could be calculated by accounting for the shape of
the distribution of the characters.
But! this would be known only *after* we know what a character is
and the size of the alphabet. In turn this will depend on what we call
a character and therefore the size of the corpus may be different.
Would this mean that we never will be certain of what a character is
in an unknown script?
I am more confused now... :-/
Gabriel