VMs: First attempts at cluster analysis (not so encouraging)

I've been playing around with some text analysis tools I found on the web,
to see if they could find words with similar meaning. My first tests are not
really encouraging. I've been searching for word clusters in Lovecraft's "At
the mountains of madness", because it was a text I know and I could grab it
off the Internet easily.

There are a lot of variables in the method. The succes of the method depends
quite heavily on the size of the corpus and the frequency of the words
you're putting into the algorithm. The size of the windows within which word
dependencies are calculated plays a big part. Finally the similarity
coefficient of the clustering algorithm is quite important. It took me quite
a while to get any result at all.

Now if you look at experiments A and B (the best results I have at the time)
you will see that the algorithm is not good at separating similar words from
a more-or-less random list (A). I can see no real logic in most of the
clusters it produces.
However it does much better at separating two distinct classes: colors -
numbers (B). Here it produces one cluster with numbers only.

This means that it's not useful for getting a clue about the meaning of VMS
"words". But if we can guess the possible class (star name, plant name,
number, quantity) of one VMS "word" we might use the method to find other
members of the class.

This all on the assumption that VMS "words" are really words. Which is not
at all certain.

I'll try to tune the methods some more, Maybe the can get better than this,
at least the articles I posted would suggest it.

Greetings, Petr

*** Experiment A:

This is a test of some medium frequent words in the text. The result is not
encouraging:

Word clusters for minimum similarity value : 0,3
Cluster 2 :  thousand, snow, too, years, when, mountains, range, two, walls,
thought, then, will, things, these, saw, wind, only.
Cluster 7 :  sea, indeed.
Cluster 12 :  thing, yet, time, since, well, unknown, vast, new, through,
three, specimens, others, where, plane, might.
Cluster 20 :  such, them, whose.
Cluster 45 :  much, most.

Word clusters for minimum similarity value : 0,5
Cluster 2 :  thousand, snow, too, years, when, mountains, range.
Cluster 3 :  two, walls, thought, then, will.
Cluster 6 :  saw, wind.
Cluster 19 : thing, yet, time, since, well, unknown, vast, new, through,
three.
Cluster 29 :  very, off.
Cluster 33 :  them, whose.

Word clusters for minimum similarity value : 0,7
Cluster 2 :  thousand, snow, too, years, when.
Cluster 5 :  two, walls, thought.
Cluster 24 :  thing, yet, time, since, well, unknown, vast, new, through.

Word clusters for minimum similarity value : 0,9
Cluster 2 :  thousand, snow.
Cluster 3 :  too, years.
Cluster 7 :  two, walls.
Cluster 27 :  thing, yet, time.
Cluster 29 :  well, unknown, vast, new.

*** Experiment B:

Word clusters for minimum similarity value : 0,3
Cluster 1 :  black.
Cluster 2 :  white, red, three, seven, two, zero, six, nine.
Cluster 3 :  eight.
Cluster 4 :  five.
Cluster 5 :  one.
Cluster 6 :  four.
Cluster 7 :  green.

Word clusters for minimum similarity value : 0,5
Cluster 1 : black.
Cluster 2 : white.
Cluster 3 : red, three.
Cluster 4 : seven, two, zero, six.
Cluster 5 : nine.
Cluster 6 : eight.
Cluster 7 : five.
Cluster 8 : one.

Word clusters for minimum similarity value : 0,7
Cluster 1 : black.
Cluster 2 : white.
Cluster 3 : red.
Cluster 4 : three.
Cluster 5 : seven, two, zero, six.
Cluster 6 : nine.
Cluster 7 : eight.
Cluster 8 : five.
Cluster 9 : one.
Cluster 10 : four.
Cluster 11 : green.

Word clusters for minimum similarity value : 0,9
Cluster 1 : black.
Cluster 2 : white.
Cluster 3 : red.
Cluster 4 : three.
Cluster 5 : seven, two, zero, six.
Cluster 6 : nine.
Cluster 7 : eight.
Cluster 8 : five.
Cluster 9 : one.
Cluster 10 : four.
Cluster 11 : green.

______________________________________________________________________
To unsubscribe, send mail to majordomo@xxxxxxxxxxx with a body saying:
unsubscribe vms-list