[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: VMs: f85r2 "four ages" diagram ( word boundaries)

--- knoxmix@xxxxxxxxxxxxx wrote:
> Hello Eric,
> Re. collocation analysis. It appears to me there are
> fewer phrases or 
> strings of words than a non-linguistic chance
> assortment of words 
> with the same array of frequencies would produce. (I
> do not have a 
> word for such a condition). Can you test for that
> with spaces intact? 
> Using letter strings only? 
> What is the definition of  "minimum description
> length"?
> KM

Yes, that's essentially the idea behind a collocation
- that the two items next to each other mean more than
they do apart. "Tennis Court" means a very specific
idea, seperate from the words "tennis" and "court" on
their own.

Using a likelihood ratio to find collocations can be
done on either letter-letter or word-word combinations
(where you would take the spaces to indicate the word
boundary). Problem is... just about everything in
language is "on purpose" and statistically signficant.
So "and the" is signficant, along with "tennis court"
and just about everything else that appears in

However, I haven't tried colocation likelihood
measurements on gibberish or heavily encrypted text
(by that i mean, a simple substitution cipher would
behave the same as plain text for collocations). I
would guess in these cases the number of significant
collocations would drop, but by how signficantly I
don't know. I did run collocation likelihoods once on
the VMS text (see my long message about concentrating
on known languages) and didn't see any anomolies (the
character combinations we always see together - 4o -
show up).

Minimum description length -- I just posted in my long

If you have any ideas you would like to try out, let
me know!

Do you Yahoo!?
New and Improved Yahoo! Mail - 100MB free storage!
To unsubscribe, send mail to majordomo@xxxxxxxxxxx with a body saying:
unsubscribe vms-list