[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

VMs: Text custering approaches (finding synonyms & paragraphs in text)



> So, the people do not care very much, is it
> written "olodabas", " oladbas", "olodobos", maybe more. And the poor
scientist
> 500 years later believes, these all are different words...

This might be tested by using the methods described in:

Unsupervised discovery of morphologically related words based on
orthographic and semantic similarity
http://www.cogsci.ed.ac.uk/sigphon/papers/BaroniMatiasekTrost02.pdf

Finding Semantically Related Words in Large Corpora
http://nlp.fi.muni.cz/publications/tsd2001_smrz_pary/tsd2001_smrz_pary.pdf

However I tried this approach on Lovecraft's "At the mountains of madness"
and didn't have much success with it. I was able to separate two classes of
words (colors versus numbers) but I wasn't able to make useful clusters out
of a random sample of words. So my first guess is that VMS is too small to
apply statistical clustering alogrithms. But I may be worong.

+ new idea: text clustering

While looking for the above article I found this interesting approach. I'll
have to think about it, it's not easy stuff. But it might help us to find
sentences or paragraphs in the VMS, if they exist:

Detecting Subject Boundaries Within Text: A Language
Independent Statistical Approach
http://acl.ldc.upenn.edu/W/W97/W97-0305.pdf

MULTI-PARAGRAPH SEGMENTATION OF EXPOSITORY TEXT
http://www.sims.berkeley.edu/~hearst/papers/tiling-acl94/acl94.html

______________________________________________________________________
To unsubscribe, send mail to majordomo@xxxxxxxxxxx with a body saying:
unsubscribe vms-list