[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: VMs: Re: Moot points, getting long



On Thursday 05 August 2004 10:48, John Grove wrote:
> However, one does often use EVA rather than other transliterations and
> provide 'skewed' results:
> 	For example...
> ee 5395
> ii 4769
> hh 84

No, not skewed at all. That table shows one of the anomalies of the vms corpus 
as captured in eva. I guess that it was meant "skewed in comparison to what 
it should be".  But what should this be? Isn't that imposing a property 
without knowing whether it holds?

Note that <e> and <i> happen on their own in words (<e> even appears in one of 
the key-like sequences) while <ee> and <ii>  (or even <iin>) don't appear as 
single characters (i.e. in any of the key-like sequences, I know, absence of 
evidence is not evidence of absence ;-) ).

It may well be that some <ee> and <ii> are single characters, but perhaps not 
all of them?. Which ones should be agglomerated?
As I pointed out before, how does one solve 3 on a row? According to the "next 
character" is indeed a neat idea, but how do we know if that it is a good 
assumption? 
And more importantly what is it gained by representing <ee> as a new 
character?

I am not trying to criticise the attempts to find a better representation 
since I also tried this some time ago by modifying Rene's curva alphabet 
(more below). But at the same time I do not think that we (or at least me!)  
have clear what should such representation achieve (or at least this has not 
been explicit).

Is it represent what you see? 
Is it maximise the entropy in the hope that there is some kind of substitution 
cipher?
Or match duplets/character counts to something similar to a known language?

I must say that I am not too keen in changing an arbitrary representation 
(eva) that makes few assumptions for another arbitrary representation that 
makes more assumptions without any obvious advantage.

Maybe <ch> is too common to be a duplet (it is indeed the most common eva 
duplet)  however <he>, <dy> and <ai> are also quite common (8262, 6952 and 
6793 respectively). Should these also be single characters? Where does one 
draw the line? 

Furthermore, as Elmar noted recently, considering those <e>, <i> and <ch> 
represented as single or double characters does not explain any unknowns any 
better:
1. word-frequency statistics remain exactly the same, and 

2. character agglomeration does not increase the entropy to any values near 
those of natural languages. 
The asymptotic entropy values in gava and curva [which agglomerate complex 
gallows, <ch>, <sh>, multiple<e>s and <i>s]  are virtually the same as in 
Currier alphabet): the low entropy is not only due to the representation of 
<iin> or <ee> but to many other common duplets including word-starting and 
ending characters. So there is still no chance of character substitutions as 
the entropy still remains low.

Let's not forget about known languages, where strange things do also happen.
For instance, in Spanish the letter "q" is *always* followed by "u" and then 
only by "i" or "e".
If one does duplet counts, in the "q*" group you get only 1 entry for "qu". 
And if triplets are counted, one gets just 2 entries: "qui" and "que". The 
second can also appear also as 2 words (with and without accent).
Following hunches, a non Spanish-aware character-cruncher would be very 
tempted to say that "qui" and "que" are single characters.

I am sure that the Real Academia Española would not be very impressed :-)

Cheers,

Gabriel

______________________________________________________________________
To unsubscribe, send mail to majordomo@xxxxxxxxxxx with a body saying:
unsubscribe vms-list