[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
VMs: Text custering approaches (finding synonyms & paragraphs in text)
> So, the people do not care very much, is it
> written "olodabas", " oladbas", "olodobos", maybe more. And the poor
scientist
> 500 years later believes, these all are different words...
This might be tested by using the methods described in:
Unsupervised discovery of morphologically related words based on
orthographic and semantic similarity
http://www.cogsci.ed.ac.uk/sigphon/papers/BaroniMatiasekTrost02.pdf
Finding Semantically Related Words in Large Corpora
http://nlp.fi.muni.cz/publications/tsd2001_smrz_pary/tsd2001_smrz_pary.pdf
However I tried this approach on Lovecraft's "At the mountains of madness"
and didn't have much success with it. I was able to separate two classes of
words (colors versus numbers) but I wasn't able to make useful clusters out
of a random sample of words. So my first guess is that VMS is too small to
apply statistical clustering alogrithms. But I may be worong.
+ new idea: text clustering
While looking for the above article I found this interesting approach. I'll
have to think about it, it's not easy stuff. But it might help us to find
sentences or paragraphs in the VMS, if they exist:
Detecting Subject Boundaries Within Text: A Language
Independent Statistical Approach
http://acl.ldc.upenn.edu/W/W97/W97-0305.pdf
MULTI-PARAGRAPH SEGMENTATION OF EXPOSITORY TEXT
http://www.sims.berkeley.edu/~hearst/papers/tiling-acl94/acl94.html
______________________________________________________________________
To unsubscribe, send mail to majordomo@xxxxxxxxxxx with a body saying:
unsubscribe vms-list