[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: WG: average word length in VMS
Jacques Guy wrote:
> Er... phonosyntactic oddity? You mean the way in which
> the letters or groups of letters presumably representing
> sounds combine together?
Yes, pretty much that's what I mean. If entropy meausres the
percentage chance to correctly guess the next letter, than those
numbers should give each token (or word, or even down to
trigraphs) a value of it's wierdness relative to the general
stats of the language. Basically, given the analyses of an
enciphered text book (in English) on early civilizations of the
Americas, couldn't we tell that the word Qetzacoatl didn't
belong? If we could peg foreign words in the text, we could
make reasonable guesses about what language they were in
(English texts rarely include Serbian, but French and Latin are
not uncommon). Beyond that on a subtler level, spellings like
'ough' in English don't follow the standard rules, but would
leave a fingerprint by being a reasonably large set of a
'standard group of English spelling anomalies'. I think we
might be able to get a fingerprint saying something to the
effect of 83% of words follow the rules to a reasonable degree,
12% of words deviate by value A, 4% deviate by value B, and 4.3%
deviate by value C with the remaining .7% deviating by a value
greater than C. In a truly phonetic representation of an
un-influenced language, there should be no words that deviate.
Addition of foreign words, foreign sounds (English 'th' from the
Anglo-Saxons) and foreign spellings would all be factors that
will cause deviations. This should give a fingerprint to a
language based on the alphabet they created or borrowed, the
language's historical influences and the nations physical
proximity/trade policy with neighboring countries. With this
sort of model, one might even be able to correctly decide if an
encrypted text was Berlin German or Belgian German (given those
as the only two choices). Thoughts?
BTW, here is a good page on Chinese dialects
Regards,
Brian