[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: John Chadwick (Linear B) of corpus size. Comments invited.



    > [Rene:] If the binary encoding is done without loss of
    > information (which would be fair), one needs two more symbols: a
    > character space and a word space.
    
Um, not really. The word separator would be coded in binary just 
like any letter. As for character separators, they are
not needed if one uses a fixed-length code (e.g. ascii), or
any of a number of self-delimiting codes.

    > Then, any child will see that instead of 4 symbols, there really
    > are N+2 macro-symbols, and one is back to square 1.
    
Well, but how would one know that certain groups of characters are
really macro-characters, before deciphering the script?

    > [Andras:] The formula may not be as wrong as everybody here
    > seems to suppose... First, it should be taken to apply only to
    > "NYN" type languages.
    > 
    > The 0th step of the analysis is to arrange the symbols in
    > frequency order. The 1st step is to construct a grid of what can
    > follow what. ... we need n^2 data points (actually, some
    > constant time n^2 is better) ...
    
This analysis assumes a cryptographic-style attack based on digraph
frequencies. But my impression is that decipherment of known natural
languages with unknown scripts (my "NYN" case) hardly happen that way.
For one thing, the frequencies of letters --- not to mention digrams
--- are strongly affected by subject matter and spelling anomalies,
which are the rule in ancient texts. 

Unfortunately there don't seem to be many historical examples of pure
NYN decipherment to go by. In most cases (Egyptian, Cuneiform, Maya,
Hittite), decipherment relied heavily on "cribs", i.e. on some
information about the meaning of the text. 

Perhaps Linear B is one legitimate example of crib-less NYN
decipherment? But, in that specific case, I believe that the solution
was found by successfully guessing the structure of some sentences,
and identifying some characteristic morphological elements ---
particles, inflections, connective verbs, whatever. From that Ventris
got tentative values for some letters, which then made it possible to
identify other non-function words. Is this account correct?

For this approach, one needs a corpus that is just long enough to
display recognizable morphological regularities. I don't see how this
parameter could be related to the alphabet's size (except that the
approach would not work well with a logographic script).

All the best,

--stolfi