[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: VMs: Worry - information loss in transcription - pictures ...



Hi Gabriel,

Funnily, in these agglomerated alphabets, many statistics remain similar or
even the same.
For example, the low entropy is still noticeable (for those who do not believe
it, just check the counting commas pages).

For example:-


* Entropy of EVA = 221899 x 4.0 = 887596.00 bits
* Entropy of simple glyphs (+ ee) = 198098 x 4.08 = 808239.84 bits
* Entropy of pair transcription (+ ee) = 155349 x 4.36 = 677321.64 bits

Even though the pair transcription's length is only 78.4% of the simple glyph transcription, its entropy is 83.8% of the latter.

So, what's happening here?
(1) All figures are being "diluted" by the large number of spaces - a space happens every 6.4 tokens (glyphs) or every 5 tokens (pairified).
(2) Spaces aside, 27% of the pairified transcription is still dominated by 5 common tokens, and it seems to be these which keep the entropy relatively low:-
(o 7.21%) [even with all the common or/ol/ot/of/etc pairs removed!]
(ch 6.86%)
(y 6.7%)
(e 6.58%)
(k 6.28%)
The problem here is that there's a noticeable drop in instance count after these five, which seems artificial.


So, another plausible pairification would be my basic pair transcription but with these extra pairs:-
ee / ke / kee / te / tee / che / chee / she / shee


* Entropy of pairs + [k/t/ch/sh][e][e] = 141167 x 4.6 = 649368.2 bits

For this, the most common tokens are:-
        (. 21.91%)
        (y 7.37%)
        (o 7.08%)
        (dy 4.74%)
        (k 4.21%)
        (d 4.16%)

However, I'm not 100% convinced here either. These look like neither curvy enough for monoalpha (ie natural language) stats nor flat enough for polyalpha stats - nor like any other obvious kind of distribution. There are no obvious cycle lengths, and spaces seem roughly 1.5x more frequent (relative to the rest of the text) than I'd expect.

So, yes - even in agglomerated alphabets, entropy is low... but I think it's being kept low by the large number of spaces and a small handful of frequent symbols (most specifically <o>, <y> and <dy>), which are being used in a non-obvious way. Any account of the VMs would need to include some idea of how these special characters function, even apart from frequent pairs.

And isn't it strange how <o> and <y> are so common, yet so very rarely occur beside each other? Glyph transcription + ee + oy + yo ==> (oy = 0.07% and yo = 0.05%).

Cheers, .....Nick Pelling.....


______________________________________________________________________ To unsubscribe, send mail to majordomo@xxxxxxxxxxx with a body saying: unsubscribe vms-list