[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: VMs: Re: Moot points, getting long
On Thursday 05 August 2004 10:48, John Grove wrote:
> However, one does often use EVA rather than other transliterations and
> provide 'skewed' results:
> For example...
> ee 5395
> ii 4769
> hh 84
No, not skewed at all. That table shows one of the anomalies of the vms corpus
as captured in eva. I guess that it was meant "skewed in comparison to what
it should be". But what should this be? Isn't that imposing a property
without knowing whether it holds?
Note that <e> and <i> happen on their own in words (<e> even appears in one of
the key-like sequences) while <ee> and <ii> (or even <iin>) don't appear as
single characters (i.e. in any of the key-like sequences, I know, absence of
evidence is not evidence of absence ;-) ).
It may well be that some <ee> and <ii> are single characters, but perhaps not
all of them?. Which ones should be agglomerated?
As I pointed out before, how does one solve 3 on a row? According to the "next
character" is indeed a neat idea, but how do we know if that it is a good
assumption?
And more importantly what is it gained by representing <ee> as a new
character?
I am not trying to criticise the attempts to find a better representation
since I also tried this some time ago by modifying Rene's curva alphabet
(more below). But at the same time I do not think that we (or at least me!)
have clear what should such representation achieve (or at least this has not
been explicit).
Is it represent what you see?
Is it maximise the entropy in the hope that there is some kind of substitution
cipher?
Or match duplets/character counts to something similar to a known language?
I must say that I am not too keen in changing an arbitrary representation
(eva) that makes few assumptions for another arbitrary representation that
makes more assumptions without any obvious advantage.
Maybe <ch> is too common to be a duplet (it is indeed the most common eva
duplet) however <he>, <dy> and <ai> are also quite common (8262, 6952 and
6793 respectively). Should these also be single characters? Where does one
draw the line?
Furthermore, as Elmar noted recently, considering those <e>, <i> and <ch>
represented as single or double characters does not explain any unknowns any
better:
1. word-frequency statistics remain exactly the same, and
2. character agglomeration does not increase the entropy to any values near
those of natural languages.
The asymptotic entropy values in gava and curva [which agglomerate complex
gallows, <ch>, <sh>, multiple<e>s and <i>s] are virtually the same as in
Currier alphabet): the low entropy is not only due to the representation of
<iin> or <ee> but to many other common duplets including word-starting and
ending characters. So there is still no chance of character substitutions as
the entropy still remains low.
Let's not forget about known languages, where strange things do also happen.
For instance, in Spanish the letter "q" is *always* followed by "u" and then
only by "i" or "e".
If one does duplet counts, in the "q*" group you get only 1 entry for "qu".
And if triplets are counted, one gets just 2 entries: "qui" and "que". The
second can also appear also as 2 words (with and without accent).
Following hunches, a non Spanish-aware character-cruncher would be very
tempted to say that "qui" and "que" are single characters.
I am sure that the Real Academia Española would not be very impressed :-)
Cheers,
Gabriel
______________________________________________________________________
To unsubscribe, send mail to majordomo@xxxxxxxxxxx with a body saying:
unsubscribe vms-list