[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: VMs: word boundaries
During my very optimistic beginnings :) I was hoping
that the collocation analysis would not only resolve
where the word boundaries were, but also what the
letters where (starting with EVA as the most fine
grained). I ran into the "th" issue in English quickly
- by far, most of the time "th" should be considered a
single letter in English, but not always... and that
"not always" is the killer.
I think I'll run a collocation on the letters in EVA
and post the results, just to see what sort of
insights others might have - I might be missing
something really obvious.
--- Nick Pelling <incoming@xxxxxxxxxxxxxxxxx> wrote:
> Hi Eric,
> At 20:35 12/07/2004 -0700, Eric wrote:
> >However, I haven't tried colocation likelihood
> >measurements on gibberish or heavily encrypted text
> >(by that i mean, a simple substitution cipher would
> >behave the same as plain text for collocations). I
> >would guess in these cases the number of
> >collocations would drop, but by how signficantly I
> >don't know. I did run collocation likelihoods once
> >the VMS text (see my long message about
> >on known languages) and didn't see any anomolies
> >character combinations we always see together - 4o
> >show up).
> If Voynichese is based in part on a verbose cipher
> (where certain digraphs
> ('diglyphs'?) like "qo" ('4o'), 'dy', 'or', 'ol',
> etc code for special
> tokens), then you might also be able to tease out
> interesting results from
> it using these kinds of analyses.
> However, the analytical problem is that the basic
> concept of "letter"
> becomes rather amorphous: for example, it seems that
> in many/most/all
> cases, Voynichese "o" has no independent meaning -
> so any analysis that
> relies on a concept of a "state" associated with
> that "letter" will be
> misleading. Take a basic pair of words like "otedy"
> and "qotedy": I
> personally have little doubt that the latter should
> probably be parsed as
> "qo-t-e-dy" (or perhaps "qo-te-dy") - but what about
> the former?
> If all "o"s are misleading (that is, if
> free-standing "o" has no meaning),
> then we should expect to parse it as "ot-e-dy" - but
> IIRC other analyses
> suggest that "tedy" is some kind of word base here,
> with "qo-" and "o-" as
> Perhaps you might consider how to apply your box of
> tricks to test this "o
> is not a real letter" hypothesis?
> Cheers, .....Nick Pelling.....
> To unsubscribe, send mail to majordomo@xxxxxxxxxxx
> with a body saying:
> unsubscribe vms-list
Do you Yahoo!?
New and Improved Yahoo! Mail - 100MB free storage!
To unsubscribe, send mail to majordomo@xxxxxxxxxxx with a body saying: