[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: VMs: Detecting "hands" automatically



Nick Pelling wrote:

Excellent! BTW, which were the (possibly anomalous) 7 "Hand B" pages which your algorithm thought were in Cluster 1? Any commonalities between these might point to a deeper pattern... :-)

It's the other way around: one cluster had 83 hand A pages, the other had 53 hand Bs and 7 hand As.


The misclassified hand A pages were: f30r, f38v, f42v, f51r, f54v, f100r, f100v

I'd also be interested to know what would happen if you recursively passed it each set it emits, to form a binary tree (a B-tree). Even the topmost results from the tree (ie, what are the topmost sub-clusters for each of your first-pass Cluster 1 and Cluster 2?) would be interesting too. :-)

I'm not sure what the early results would tell you since the algorithm starts with an essentially random guess at where the clusters are. I did notice that if, instead of using the first two points (both hand A pages) as the trial centroids, I chose a hand A point and a hand B point, the results were slightly better (5 misclassifications instead of 7).


Finally (on my ever-expanding wish-list), as you've got the K-means process up and running it might also be revealing to apply it to a de-pairified transcription, where [for example] "qo", "dy", "ol" and "or" (and possibly "eo" as well?) are each converted into new tokens. My strong suspicion is that, because of the ubiquity of these pairs in the text, these comprised a "back-end coder", applied as a final stage - and that therefore many statistical tests might give more reliable results if applied to de-pairifed text-streams (ie to a real alphabet and not to a fake alphabet).

This is a relatively simple algorithm to code - I used a short Perl script for the test - if anyone wants to look at the code I could post it, though it would require minor changes to use a different character set.


Bruce


______________________________________________________________________ To unsubscribe, send mail to majordomo@xxxxxxxxxxx with a body saying: unsubscribe vms-list