[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: VMs: f85r2 "four ages" diagram ( word boundaries)
Hello Eric and all,
I can only do the collocations by specifying a series of words as
input. What I am looking for is a score for the VMS. Does it contain
"phrases" in number commensurate with a European language (I am sure
it does not), about equal to a random assortment of words, or less
than a random assortment? "Random" except that we know the frequency
of occurrence of the various words -- like having 20000 thoroughly
mixed rocks but only 8000 different kinds of rocks. Labels should be
omitted. The first six quires should be sufficient in length for the
test.
We might increase the VMS "score" by assigning word delimiters based
on maximum "phrases". Using all two- and three-letter words certainly
would. At some arbitrarily assigned word-length that score will drop
drastically.
One of the many projects I started is a determination of the affinity
of letters independently of their frequencies in an attempt to find
allophones. A chart for the gallows is here:
@http://home.earthlink.net/~knoxmix/id6.html
A vague look at how some common words occur within the broad vicinity
of others is here:
@http://home.earthlink.net/~knoxmix/id19.html
which also shows what to me *looks like* a fairly natural
distribution of common words across the document. However, it is far
too rough to give much information.
A problem with resolving a bigram to a single letter is that,
assuming we have made a correct choice, we do not know what the
single letter might represent. Every variation I have attempted sent
statistics aft a gley. Another problem is that there is an enormous
number of potential variations.
Ciao ...... Knox
On 12 Jul 2004 at 20:35, Eric wrote:
>
> Yes, that's essentially the idea behind a collocation
> - that the two items next to each other mean more than
> they do apart. "Tennis Court" means a very specific
> idea, seperate from the words "tennis" and "court" on
> their own.
>
> Using a likelihood ratio to find collocations can be
> done on either letter-letter or word-word combinations
> (where you would take the spaces to indicate the word
> boundary). Problem is... just about everything in
> language is "on purpose" and statistically signficant.
> So "and the" is signficant, along with "tennis court"
> and just about everything else that appears in
> language.
>
> However, I haven't tried colocation likelihood
> measurements on gibberish or heavily encrypted text
> (by that i mean, a simple substitution cipher would
> behave the same as plain text for collocations). I
> would guess in these cases the number of significant
> collocations would drop, but by how signficantly I
> don't know. I did run collocation likelihoods once on
> the VMS text (see my long message about concentrating
> on known languages) and didn't see any anomolies (the
> character combinations we always see together - 4o -
> show up).
>
> Minimum description length -- I just posted in my long
> message.
>
> If you have any ideas you would like to try out, let
> me know!
>
>
>
>
>
> __________________________________
> Do you Yahoo!?
> New and Improved Yahoo! Mail - 100MB free storage!
> http://promotions.yahoo.com/new_mail
> ______________________________________________________________________
> To unsubscribe, send mail to majordomo@xxxxxxxxxxx with a body saying:
> unsubscribe vms-list
______________________________________________________________________
To unsubscribe, send mail to majordomo@xxxxxxxxxxx with a body saying:
unsubscribe vms-list