[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: VMs: Worry - information loss in transcription - pictures ...
--- Nick Pelling <incoming@xxxxxxxxxxxxxxxxx> wrote:
> > > [Nick:]
> > > * Entropy of EVA = 221899 x 4.0 = 887596.00 bits
> > > * Entropy of simple glyphs (+ ee) = 198098 x
> 4.08 =
> > > 808239.84 bits
> > > * Entropy of pair transcription (+ ee) = 155349
> > > 4.36 = 677321.64 bits
> > [Rene:]
> >What's the 4.0 mean? And what about the 4.08?
> 4.0 / 4.08 / 4.36 are the h1 values (ie, the average
> number of bits per
> token) for each transcription.
I wondered.... It's a bit on the high side. For a
normal English/Latin text it would be about that,
but for the VMs, it's about 3.8 . The difference
between transcriptions in Eva and Currier is only
about 2%. I took that from a paper written by
> So, multiplying that figure by the number of
> token instances gives the (context-free) total size
> (in bits) of each transcription.
Then the above figures essentially show that the
text can be compressed into a smaller one without
loss of information. The 'trick' to progressively
replace pairs by new single characters and 'see
what happens' was tried a couple of years ago
by Jim Reeds, and some results are present at
Stolfi's Web site ('where are the bits' - I can't
currently get on his site, so can't provide you
with the correct link').
> >You're looking at single-character entropy, which
> >is a bit on the low side for the VMs, but it's
> >the pair entropy (or the conditional single-
> >character entropy) which is really anomalous.
> That's next on my list... :-)
I see. I won't say that you're barking up the
wrong tree, but there definitely is a bigger fish
in another tree :-)
> > > And isn't it strange how <o> and <y> are so
> > > yet so very rarely
> > > occur beside each other? Glyph transcription +
> ee +
> > > oy + yo ==> (oy = 0.07%
> > > and yo = 0.05%).
> > This is precisely the origin of the low pair
> > entropy.
> I wouldn't say these *are* the origin of
> the low pair entropy so much as they *point
> towards* the origin of it - but
> it'll take a bit of work to figure out what that
> origin is...
That is of course correct. Dennis also wondered.
I mean: the fact that lots of character combinations
which should be expected to be more or less
frequent, are in reality not frequent, causes that
we observe a low second-order entropy. There is
indeed (probably) a reason why such combinations
are 'forbidden'. This reason has been observed in
some detail also by Stolfi, but is not yet
fully explained. We see something but don't yet
understand why we see it.
Do you Yahoo!?
Yahoo! SiteBuilder - Free, easy-to-use web site design software
To unsubscribe, send mail to majordomo@xxxxxxxxxxx with a body saying: