[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: VMs: Worry - information loss in transcription - pictures ...
Hi Gabriel,
Funnily, in these agglomerated alphabets, many statistics remain similar or
even the same.
For example, the low entropy is still noticeable (for those who do not
believe
it, just check the counting commas pages).
For example:-
* Entropy of EVA = 221899 x 4.0 = 887596.00 bits
* Entropy of simple glyphs (+ ee) = 198098 x 4.08 = 808239.84 bits
* Entropy of pair transcription (+ ee) = 155349 x 4.36 = 677321.64 bits
Even though the pair transcription's length is only 78.4% of the simple
glyph transcription, its entropy is 83.8% of the latter.
So, what's happening here?
(1) All figures are being "diluted" by the large number of spaces - a space
happens every 6.4 tokens (glyphs) or every 5 tokens (pairified).
(2) Spaces aside, 27% of the pairified transcription is still dominated by
5 common tokens, and it seems to be these which keep the entropy relatively
low:-
(o 7.21%) [even with all the common or/ol/ot/of/etc pairs removed!]
(ch 6.86%)
(y 6.7%)
(e 6.58%)
(k 6.28%)
The problem here is that there's a noticeable drop in instance count after
these five, which seems artificial.
So, another plausible pairification would be my basic pair transcription
but with these extra pairs:-
ee / ke / kee / te / tee / che / chee / she / shee
* Entropy of pairs + [k/t/ch/sh][e][e] = 141167 x 4.6 = 649368.2 bits
For this, the most common tokens are:-
(. 21.91%)
(y 7.37%)
(o 7.08%)
(dy 4.74%)
(k 4.21%)
(d 4.16%)
However, I'm not 100% convinced here either. These look like neither curvy
enough for monoalpha (ie natural language) stats nor flat enough for
polyalpha stats - nor like any other obvious kind of distribution. There
are no obvious cycle lengths, and spaces seem roughly 1.5x more frequent
(relative to the rest of the text) than I'd expect.
So, yes - even in agglomerated alphabets, entropy is low... but I think
it's being kept low by the large number of spaces and a small handful of
frequent symbols (most specifically <o>, <y> and <dy>), which are being
used in a non-obvious way. Any account of the VMs would need to include
some idea of how these special characters function, even apart from
frequent pairs.
And isn't it strange how <o> and <y> are so common, yet so very rarely
occur beside each other? Glyph transcription + ee + oy + yo ==> (oy = 0.07%
and yo = 0.05%).
Cheers, .....Nick Pelling.....
______________________________________________________________________
To unsubscribe, send mail to majordomo@xxxxxxxxxxx with a body saying:
unsubscribe vms-list