[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

VMs: Detecting "hands" automatically



I have been reading about algorithms for grouping points into clusters with similar characteristics. I was curious whether an algorithm like this could detect the difference between A and B hands in the VMS based on the relative letter frequencies. After a test, it appears that it can do so pretty well. Using a version of the interlinear VMS transcription, the algorithm I used (called "K-means") classified 145 pages identified as hand A or B as follows:

Cluster 1:  83 "hand A" pages, 7 "hand B" pages
Cluster 2:   0 "hand A:" pages, 53 "hand B" pages

Using the D'Imperio transcription (A-Z, 0-9), I calculated the relative frequency of letters (i.e. count of each letter / total letters in the page) for each page, expressed as a 36-tuple of real numbers.

For a distance function to measure the closeness of the distribution between two pages, I used the normal Euclidian distance (square root of the sum of the squared differences).

For the K-means algorithm, you start by chosing the number of clusters you are looking for (2 in this test) and choosing that many points as first guesses for the centers of the clusters. (Typically you just use the first N points in the list. as I did.)

Then, you repeatedly do the following steps, until cluster assignments don't change anymore:
1. Assign each point (page) to the cluster whose center it is nearest to.
2. For each cluster, re-estimate its center point as the average of all the points in the cluster.


The results were those shown above. This suggests that there are objective differences in the letter frequenices of the two hands, and that it might be interesting to look for similar relationships between clusters and other features, such as topic of page.

Bruce



______________________________________________________________________
To unsubscribe, send mail to majordomo@xxxxxxxxxxx with a body saying:
unsubscribe vms-list