[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: WG: average word length in VMS
I see a lot of encouraging ideas in all the latest
mail!! (Or at least ideas that parallel my own
thinking. ;-) )
Brian Eric Farnell wrote:
>
> I really like the syllabic idea, too. Before I found the
> discussion of the Chinese theory, I also got that impression,
> much to my embarrasment after I posted stuff that was already
> discussed. My problem with the syllabic idea is that we should
> see more 1,2 and 3 letter tokens than expected, not less.
Jacques Guy wrote:
>
> Jorge Stolfi wrote:
>
>
> > I seem to recall that
> > Sukhotin's algorithm applied to Voynichese produced only a few
> > unconvincing results that led nowhere (probably echoes of the OKOKOKO
> > model). Part of the problem may have been the multiletter
> > Voynichese->EVA encoding, which tends to obscure the C-V alternations.
>
> Not part of the problem. The whole problem, I'd say, or very close.
I feel confident that in Voynichese clusters of
letters represent individual phonemes, and that
probably explains the high h1-h2 of Voynichese.
Jacques Guy wrote:
> Jorge Stolfi wrote:
> On the other hand, the KMC structure is not unlike the structure of
> > single *syllables* in Latin and other natural languages. Syllable
> > boundaries are partly a matter of convention; but, off of my head, I
> > would guess that the Latin syllable can be said to have the general
> > structure SCRVVN where all letters are optional except for one V; and
> > S, R, and N are specific subsets of the consonnats:
> >
> > in prin ci pio cre a vit de us cae lum et te rram
> > te rra au tem e rat i na nis et va cu a et te ne brae
> > su per fa ci em a by ssi ...
> >
> Look, I know Latin, I know Chinese. The pattern you have uncovered
> looks
> strikingly like Chinese. Latin? Let me scratch my head. Scratch...
> scratch...
> scratch... scratch... er.... scratch... scratch... please don't wait
> for me.
I've thought for a long time that Voynichese "words"
are syllables. This indicates further that Western
languages other than French could be good candidates.
How about Venetian, Jorge? :-) (Why do we think that
the VMs probably came from Venice? I think that was
Toresella's idea.)
I also used to think that also the strict internal
structure of Voynichese words contributed to the high
h1-h2 of Voynichese. However, the present discussion
shows that you see a constrained, paradigmatic
structure both in "monosyllabic" Eastern languages and
in "polysyllabic" Western languages.
Is this really surprising? Consider the fact that the
"monosyllabic" Eastern languages form compounds of more
than one syllable that would be considered "words" in
Western languages. The real difference between
"monosyllabic" Eastern languages and Western
"polysyllabic" languages is that in the Eastern
languages all forms are free, while in the Western
languages many forms are bound. In English such
prefixes as un-, super- , ab- , and inter- , and
suffixes like -tion, -ly, -ment, and -ary cannot occur
as free-standing words. In Mandarin any syllable may
occur as a free-standing word, but it can also can form
compounds with other syllables that a Westerner would
consider a "word", even if the Chinese themselves do
not think of it this way. They have syllables that
can perform the functions of un-, super- -ly, -tion,
etc. We've discussed this on this list.
So perhaps finding that Western languages exhibit
syllabic structure comparable to Voynichese and to
Eastern languages as well should come as no surprise.
Brian Eric Farnell wrote:
>
> Of
> course that might explain all of the 'ain' and 'aiin's if they
> were nulls thrown in to obfuscate short syllables. As far as
> Chinese proper, there are alot more Voynich tokens than modern
> Mandarin, I don't have the stats for other dialects, but I think
> most of them are still too small.
How many tokens are there in Voynichese? I forget.
Brian Eric Farnell wrote:
>
> Jacques Guy wrote:
> > Er... phonosyntactic oddity? You mean the way in which
> > the letters or groups of letters presumably representing
> > sounds combine together?
> Yes, pretty much that's what I mean. If entropy meausres the
> percentage chance to correctly guess the next letter, than those
> numbers should give each token (or word, or even down to
> trigraphs) a value of it's wierdness relative to the general
> stats of the language. Basically, given the analyses of an
> enciphered text book (in English) on early civilizations of the
> Americas, couldn't we tell that the word Qetzacoatl didn't
> belong? If we could peg foreign words in the text, we could
> make reasonable guesses about what language they were in
> (English texts rarely include Serbian, but French and Latin are
> not uncommon). Beyond that on a subtler level, spellings like
> 'ough' in English don't follow the standard rules, but would
> leave a fingerprint by being a reasonably large set of a
> 'standard group of English spelling anomalies'. I think we
> might be able to get a fingerprint saying something to the
I don't think the entropy would be the right measure
here. Two different languages could have the same
proportions of combinations like qu- (u almost always
follows q, so the u here gives very little information)
and the same h1-h2.
You could formulate rules for what constitutes a valid
syllable in a given language (or a valid word) and see
whether the language in question follows it.
(But English is the very worst thing to use as an
example here. English borrows words very widely from
other languages, but, rather than change their spelling
to fit core English phonetics, it keeps the original
language's spelling and thereby often causes English
speakers to mispronounce them. My college French Prof.
couldn't stand the way Americans pronounce "reveille"
as "REH- vuh- lee" instead of "ruh-vey-YEY".)
(Even further afield. Can anyone tell me what the
Polish word "potrzebie" means? English-speakers
pronounce this word, used in the adolescent humor
magazine MAD, "pah-tur-ZEE-bee", but the Poles
pronounce it "poh-CHEB-yeh". I thought it meant
"aspirin" but the Poles say "aspirin" too.)
Back to the topic at hand. Another, easier way would
be to compare the two languages' single letter and
digraph frequencies. The chi^2 and phi^2 tests would
tell you whether any difference is statistically
significant.
Question. The chi^2 and phi^2 tests tell you whether
a difference is significant or not. It seems to me
that if we want to measure the *magnitude* of the
difference between two different languages, we should
use the *level of confidence for the difference for the
chi^2 or phi^2 test used*. Is this correct?
If we knew that Voynichese writing isn't homophonic,
we could simply compare the number of syllables in a
given language with the number of tokens in Voynichese
and thus identify good candidates for the underlying
language of Voynichese.
But Voynichese writing may be homophonic. I've always
assumed this is the reason for the difference between A
and B. However, more recent studies seem to show that
different writing styles may create the difference seen
with A and B. I believe Prescott Currier discovered
the difference between A and B. I just quickly looked
through his paper and didn't see how he established
this; he said he hadn't made slides for those
statistics.
We need a measure of the difference between A and B.
Earlier I proposed the level of significance of the
difference using chi^2 or phi^2. If we see that the
difference between A and B is no greater than the
difference between two different writers' styles in a
given language, it wouldn't be too unsafe to assume
that Voynichese is not homophonic, and we could proceed
to compare the number of Voynichese tokens with the
number of syllables in candidate languages for
Voynichese.
Of course, spelling was much less consistent for any
European language during the Middle Ages than it is
today, and that gives a homophonic effect in any case.
Further thought encouraged!!
Dennis