[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
VMs: Detecting "hands" automatically
I have been reading about algorithms for grouping points into clusters
with similar characteristics. I was curious whether an algorithm like
this could detect the difference between A and B hands in the VMS based
on the relative letter frequencies. After a test, it appears that it can
do so pretty well. Using a version of the interlinear VMS transcription,
the algorithm I used (called "K-means") classified 145 pages identified
as hand A or B as follows:
Cluster 1: 83 "hand A" pages, 7 "hand B" pages
Cluster 2: 0 "hand A:" pages, 53 "hand B" pages
Using the D'Imperio transcription (A-Z, 0-9), I calculated the relative
frequency of letters (i.e. count of each letter / total letters in the
page) for each page, expressed as a 36-tuple of real numbers.
For a distance function to measure the closeness of the distribution
between two pages, I used the normal Euclidian distance (square root of
the sum of the squared differences).
For the K-means algorithm, you start by chosing the number of clusters
you are looking for (2 in this test) and choosing that many points as
first guesses for the centers of the clusters. (Typically you just use
the first N points in the list. as I did.)
Then, you repeatedly do the following steps, until cluster assignments
don't change anymore:
1. Assign each point (page) to the cluster whose center it is nearest to.
2. For each cluster, re-estimate its center point as the average of
all the points in the cluster.
The results were those shown above. This suggests that there are
objective differences in the letter frequenices of the two hands, and
that it might be interesting to look for similar relationships between
clusters and other features, such as topic of page.
Bruce
______________________________________________________________________
To unsubscribe, send mail to majordomo@xxxxxxxxxxx with a body saying:
unsubscribe vms-list