[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: VMs: Detecting "hands" automatically
Nick Pelling wrote:
Excellent! BTW, which were the (possibly anomalous) 7 "Hand B" pages
which your algorithm thought were in Cluster 1? Any commonalities
between these might point to a deeper pattern... :-)
It's the other way around: one cluster had 83 hand A pages, the other
had 53 hand Bs and 7 hand As.
The misclassified hand A pages were: f30r, f38v, f42v, f51r, f54v,
f100r, f100v
I'd also be interested to know what would happen if you recursively
passed it each set it emits, to form a binary tree (a B-tree). Even
the topmost results from the tree (ie, what are the topmost
sub-clusters for each of your first-pass Cluster 1 and Cluster 2?)
would be interesting too. :-)
I'm not sure what the early results would tell you since the algorithm
starts with an essentially random guess at where the clusters are. I did
notice that if, instead of using the first two points (both hand A
pages) as the trial centroids, I chose a hand A point and a hand B
point, the results were slightly better (5 misclassifications instead of 7).
Finally (on my ever-expanding wish-list), as you've got the K-means
process up and running it might also be revealing to apply it to a
de-pairified transcription, where [for example] "qo", "dy", "ol" and
"or" (and possibly "eo" as well?) are each converted into new tokens.
My strong suspicion is that, because of the ubiquity of these pairs in
the text, these comprised a "back-end coder", applied as a final stage
- and that therefore many statistical tests might give more reliable
results if applied to de-pairifed text-streams (ie to a real alphabet
and not to a fake alphabet).
This is a relatively simple algorithm to code - I used a short Perl
script for the test - if anyone wants to look at the code I could post
it, though it would require minor changes to use a different character set.
Bruce
______________________________________________________________________
To unsubscribe, send mail to majordomo@xxxxxxxxxxx with a body saying:
unsubscribe vms-list