[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: VMs: Rene: Stat questions



--- GC <glenclaston@xxxxxxxxxxx> wrote:

> Rene,
> 
> In regard to your page
> http://www.voynich.nu/extra/lang.html, a question.
> 
> Your page says:
> 
>> The correlation between two pages was defined
>> as the number of words common to both pages.  
>> [...] Obviously, the number of common words
>> depends heavily on the number of words on each
>> page. [...]  a normalisation factor had to be
>> used. This factor was chosen as a constant
>> divided by the square root of the product of
>> the number of words on the two pages being
>> compared. 

> Are you speaking of a chi2 standardization factor? 
> And why use something
> like this, when you have a verifiable count of pages
> and words in each
> "language"?  What is the downside of using a
> b-total/a-total as
> standardization factor, and then moderating that
> against the page variant
> stats?  "A-page is a percentage of all A
> pages/B-page is a percentage of all
> B pages, standardized by the percentage of b/a
> pages?  

Do you mean B/A as in Currier's language? First
of all I wanted to check the validity of this
identification of two distinct languages, so there
is no real preconception about each page. The
variability of the length is a bit of a nuisance.
I wanted to allow the possibility that there were
also C and D languages, or transition languages.

Then, I wanted to obtain one number for each pair
of pages (the 'correlation' between the two pages),
not a pair of number for the pair of pages (which
could for example be the two fractions as you
suggest). If one calculates the two fractions,
one gets two different numbers if the page lengths
are not the same. In order to make this one
number inter-comparable for each pair of pages,
they should be made independent of the lenghts of
the pages.
My problem was when comparing a short page with
a long page. 

Also, if you have one reference page and want
to compare it with all other pages, the fraction
of words on your reference page common with all
other pages depends on the length of these other
pages.

Anyway, I was not completely happy with it and
if you read the follow-on article, you see a 
completely different 'correlation' statistic,
which I preferred since it is less dependent on
incidental spelling or transcription errors.

In the end, in my opinion there are no clear
C and D languages but there are transition
areas. This makes it less likely (to me)
that the VMs is from the hand of more than
one person. The observed drift could be the
result of a multitude of causes, linguistic
or cryptographic, or even combined.
While I can't prove it, the one thing it has
convinced me of is that it identifies (roughly)
the order in which the text on the pages was
composed.

Cheers ,Rene

__________________________________
Do you Yahoo!?
Yahoo! SiteBuilder - Free, easy-to-use web site design software
http://sitebuilder.yahoo.com
______________________________________________________________________
To unsubscribe, send mail to majordomo@xxxxxxxxxxx with a body saying:
unsubscribe vms-list