[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: VMs: word length counts & Computational Linguistics

I took a while to get round to replying to this one, had to check a couple of things. On suffixes there is a bit of info. in D'Imperio in the tables at the back but I am not sure about statistical info. However ...
I read an interesting article in the Journal of Computational Linguistics from a few years back (I'll dig out the reference). The basic idea is that John Goldsmith has discussed a method of computer analysis of an unknown (but Indo European?) language and to generate a morphological analysis, i.e. grammar, suffixes, etc. He has implemented this as a Windows proggy called Linguistica 2001 that can be downloaded here:
and there is a pdf of his paper there as well. (Get a copy of "Easy PDF Converter" to convert to .txt :) )
I have extracted the abstract and this follows:
This study reports the results of using minimum description length (MDL) analysis to model
unsupervised learning of the morphological segmentation of European languages, using corpora
ranging in size from 5,000 words to 500,000 words. We develop a set of heuristics that rapidly
develop a probabilistic morphological grammar, and use MDL as our primary tool to determine
whether the modifications proposedby the heuristics will beadopted or not. The resulting grammar
matches well the analysis that would be developed by a human morphologist.
In thefinal section, we discuss the relationship of this style of MDL grammatical analysis to
the notion of evaluation metric in early generative grammar.
*** end of abstract

Mart Vabar <mesinik@xxxxxx> wrote:

On Fri, 4 Jul 2003, GC wrote:

> 96 pages
> 31,412 glyphs or characters
> 8,175 words
> 2,940 unique words

how much it changes, if we cut a character or a pair in longer words?
has anybody counted how many suffixes VMS has?

To unsubscribe, send mail to majordomo@xxxxxxxxxxx with a body saying:
unsubscribe vms-list

Yahoo! Plus - For a better Internet experience