[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

VMs: word database and binomial distribution

First, for all of you text analysis people out there,
I found this database of English words (in case not
already known):


There is an online interface to the database and you
can download the whole database and search
application. It is very extensive and includes useful
items like # of syllables, # of phonemes and phonemic
syllabic transcription.

I found it while playing around with ideas of
investigating syllabic and phonemic properties of
English text, especially in regards to what Stolfi had
reported years ago about the VMS word length:


I was wondering perhaps if that was syllabic or
phonemic in structure (all VMS phonemes written as two
characters or all syllables being two characters for
instance). While plotting those, I also plotted word
length and... I got a binomial plot for it???. I used
the summary data from this page:


There are some anomolies of the data in the database
(search for all one letter words for instance), but it
doesn't look systemic. The plot is much sharper than
what Stolfi shows for "English" and "Latin" - maybe
because the database has a much larger sample of
words. My first reaction is to take that to mean the
binomial nature of the VMS pointed out by Stolfi is
unique only in that it occurs in a small data sample.


Do you Yahoo!?
Yahoo! Mail - 50x more storage than other providers!
To unsubscribe, send mail to majordomo@xxxxxxxxxxx with a body saying:
unsubscribe vms-list