[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: VMs: Worry - information loss in transcription - pictures ...



Hi Rene,

your E-mail is potentially interesting, but I can't
quite follow it.

> * Entropy of EVA = 221899 x 4.0 = 887596.00 bits
> * Entropy of simple glyphs (+ ee) = 198098 x 4.08 =
> 808239.84 bits
> * Entropy of pair transcription (+ ee) = 155349 x
> 4.36 = 677321.64 bits

What's the 4.0 mean? And what about the 4.08?

4.0 / 4.08 / 4.36 are the h1 values (ie, the average number of bits per token) for each transcription. So, multiplying that figure by the number of token instances gives the (context-free) total size (in bits) of each transcription. Because the transcription changes the token count, it's important here to show the comparison in absolute terms (ie, number of bits) rather than in relative terms (ie, number of bits per token).


You're looking at single-character entropy, which
is a bit on the low side for the VMs, but it's
the pair entropy (or the conditional single-
character entropy) which is really anomalous.

That's next on my list... :-)


> And isn't it strange how <o> and <y> are so common,
> yet so very rarely
> occur beside each other? Glyph transcription + ee +
> oy + yo ==> (oy = 0.07%
> and yo = 0.05%).

This is precisely the origin of the low pair entropy.

I'm comfortable with <o> acting as a kind of "shift" character (because of or/ol/ok/ot etc) - even though that still fails to explain a large percentage of occurrences of <o>, but not quite so comfortable about positing the same thing for <y>. I wouldn't say these *are* the origin of the low pair entropy so much as they *point towards* the origin of it - but it'll take a bit of work to figure out what that origin is...


Cheers, .....Nick Pelling.....


______________________________________________________________________ To unsubscribe, send mail to majordomo@xxxxxxxxxxx with a body saying: unsubscribe vms-list