[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: VMs: f85r2 "four ages" diagram ( word boundaries)



Hello Eric and all,

I can only do the collocations by specifying a series of words as 
input. What I am looking for is a score for the VMS. Does it contain 
"phrases" in number commensurate with a European language (I am sure 
it does not), about equal to a random assortment of words, or less 
than a random assortment? "Random" except that we know the frequency 
of occurrence of the various words -- like having 20000 thoroughly 
mixed rocks but only 8000 different kinds of rocks. Labels should be 
omitted. The first six quires should be sufficient in length for the 
test. 

We might increase the VMS "score" by assigning word delimiters based 
on maximum "phrases". Using all two- and three-letter words certainly 
would. At some arbitrarily assigned word-length that score will drop 
drastically. 

One of the many projects I started is a determination of the affinity 
of letters independently of their frequencies in an attempt to find 
allophones. A chart for the gallows is here:

@http://home.earthlink.net/~knoxmix/id6.html

A vague look at how some common words occur within the broad vicinity 
of others is here:

@http://home.earthlink.net/~knoxmix/id19.html

which also shows what to me *looks like* a fairly natural 
distribution of common words across the document. However, it is far 
too rough to give much information. 

A problem with resolving a bigram to a single letter is that, 
assuming we have made a correct choice, we do not know what the 
single letter might represent. Every variation I have attempted sent 
statistics aft a gley. Another problem is that there is an enormous 
number of potential variations. 

Ciao ...... Knox

 

On 12 Jul 2004 at 20:35, Eric wrote:
> 
> Yes, that's essentially the idea behind a collocation
> - that the two items next to each other mean more than
> they do apart. "Tennis Court" means a very specific
> idea, seperate from the words "tennis" and "court" on
> their own.
> 
> Using a likelihood ratio to find collocations can be
> done on either letter-letter or word-word combinations
> (where you would take the spaces to indicate the word
> boundary). Problem is... just about everything in
> language is "on purpose" and statistically signficant.
> So "and the" is signficant, along with "tennis court"
> and just about everything else that appears in
> language.
> 
> However, I haven't tried colocation likelihood
> measurements on gibberish or heavily encrypted text
> (by that i mean, a simple substitution cipher would
> behave the same as plain text for collocations). I
> would guess in these cases the number of significant
> collocations would drop, but by how signficantly I
> don't know. I did run collocation likelihoods once on
> the VMS text (see my long message about concentrating
> on known languages) and didn't see any anomolies (the
> character combinations we always see together - 4o -
> show up).
> 
> Minimum description length -- I just posted in my long
> message.
> 
> If you have any ideas you would like to try out, let
> me know!
> 
> 
> 
> 
> 
> __________________________________
> Do you Yahoo!?
> New and Improved Yahoo! Mail - 100MB free storage!
> http://promotions.yahoo.com/new_mail 
> ______________________________________________________________________
> To unsubscribe, send mail to majordomo@xxxxxxxxxxx with a body saying:
> unsubscribe vms-list


______________________________________________________________________
To unsubscribe, send mail to majordomo@xxxxxxxxxxx with a body saying:
unsubscribe vms-list