[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: VMs: Detecting "hands" automatically



Hi Bruce,

At 14:41 18/12/2003 -0500, Bruce Grant wrote:
I have been reading about algorithms for grouping points into clusters with similar characteristics. I was curious whether an algorithm like this could detect the difference between A and B hands in the VMS based on the relative letter frequencies. After a test, it appears that it can do so pretty well.

Excellent! BTW, which were the (possibly anomalous) 7 "Hand B" pages which your algorithm thought were in Cluster 1? Any commonalities between these might point to a deeper pattern... :-)


I'd also be interested to know what would happen if you recursively passed it each set it emits, to form a binary tree (a B-tree). Even the topmost results from the tree (ie, what are the topmost sub-clusters for each of your first-pass Cluster 1 and Cluster 2?) would be interesting too. :-)

Finally (on my ever-expanding wish-list), as you've got the K-means process up and running it might also be revealing to apply it to a de-pairified transcription, where [for example] "qo", "dy", "ol" and "or" (and possibly "eo" as well?) are each converted into new tokens. My strong suspicion is that, because of the ubiquity of these pairs in the text, these comprised a "back-end coder", applied as a final stage - and that therefore many statistical tests might give more reliable results if applied to de-pairifed text-streams (ie to a real alphabet and not to a fake alphabet).

Cheers, .....Nick Pelling.....


______________________________________________________________________ To unsubscribe, send mail to majordomo@xxxxxxxxxxx with a body saying: unsubscribe vms-list