[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
VMs: Rene: Stat questions
Rene,
In regard to your page http://www.voynich.nu/extra/lang.html, a question.
Your page says:
The correlation between two pages was defined as the number of words common
to both pages. If any word occurred several times on one page, each
occurrence was counted. The following example may explain this more clearly:
Page 1: Ape Ape Bear Cat Cat Cat
Page 2: Ape Ape Ape Boar Cat
The number of common words is three: two times Ape and one Cat. Obviously,
the number of common words depends heavily on the number of words on each
page. Since the number of words per page is highly variable (and correlated
with the language used, B pages being much more verbose), a normalisation
factor had to be used. This factor was chosen as a constant divided by the
square root of the product of the number of words on the two pages being
compared. This is not a perfect method, and suggestions for finding a better
'rule' would be appreciated.
I have some interest in the last part especially:
"This factor was chosen as a constant divided by the square root of the
product of the number of words on the two pages being compared."
Are you speaking of a chi2 standardization factor? And why use something
like this, when you have a verifiable count of pages and words in each
"language"? What is the downside of using a b-total/a-total as
standardization factor, and then moderating that against the page variant
stats? "A-page is a percentage of all A pages/B-page is a percentage of all
B pages, standardized by the percentage of b/a pages? Why wouldn't this be
the most representative of any page statistic? I ask because I'm facing the
identical statistical quandry and seeking to express myself in the best
light (light that hides the hunched back and mole on the nose.)
GC
______________________________________________________________________
To unsubscribe, send mail to majordomo@xxxxxxxxxxx with a body saying:
unsubscribe vms-list