[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
RE: VMs: Rene: Stat questions
Rene,
Thanks for the info. It's much clearer. I'm asking some slightly different
questions, but I think I'm also duplicating some of what you've already
done, so I'm taking some time to digest your studies and see how they apply
to what I'm seeing.
You've mentioned the transitions that I've also seen, and I think this is a
very important thing to make note of, as there is much more information to
be gleaned from this.
I've taken out all words from the herbal section that are shared between
[ha] and [hb], and looked at the transition of words and similar word forms
for both of these. I need to come up with causes for some of what I'm
seeing, but a general overview might help to gain some ideas from you and
others on ways of looking at this that might be useful.
I have doubts that the [hb] bifolios are out of place, since they seem to
follow a certain pattern. They are not the same as [ha], but they follow a
pattern associated with [ha], in my view. [hb] pages have 26.2% of their
combined words as unique words, occurring only once in the herbal section.
[ha] has almost three times as many words (7,036 vs. 2,728), but also has
24.5% of its total words as unique words. Even with such a large difference
in total word occurrences, [ha] shares 50.5% of its words with [hb], while
[hb] shares 54.6% of its total words with [ha]. These are far too close, so
I've been trying to find some other differences besides those that
differentiate [ha/hb].
This is in line with your page comparisons, but I'm working my way down the
groups, so please bear with me.
[ha] is not strictly [ha], in my view. Each quire has some big differences,
but there are more similarities between q1 (quire 1) and q3 than with any
other [ha] quire. q2 appears to be devoid of many of the things that join
q1 and q3, but q2 has much in common with q6.
Looking at quires is important to me at the moment because of the inclusion
of [hb] pages in q4 and beyond. q4 and q5 have many things in common, in
both [ha] and [hb]. What gets interesting is that the [hb] pages inserted
before q6 often have little in common with the [hb] sections of q6 and q7.
In many cases only a single page, 34v, shares words from q6 and q7, and
these words do not start with 41r, but with 41v and beyond in the [hb]
pages. Meanwhile, 34r shares much in common with 41r, the folio that seems
to be missing from the other grouping.
I need to find a way to go a little deeper than this. Your observation that
[hb] is about half [ha] is generally proven, but there are folios that far
more one than the other, and within folios there are paragraphs that are
almost entirely [ha] or [hb], without the usual mix. Within paragraphs
there are entire lines that are either [ha] or [hb], and even more bizarre,
there are [ha] folios before [hb] was introduced that have lines with words
that are almost wholly found in [hb] pages.
There's a general pattern and evidence of transition. Some of the
differences between [ha] and [hb] folios is simply in the word ending,
especially in longer words, and there are [ha] paragraphs and lines in [hb]
pages (and vice versa). I need to take this all much deeper - I just need
to be asking the right question. Any ideas?
GC
> -----Original Message-----
> From: owner-vms-list@xxxxxxxxxxx [mailto:owner-vms-list@xxxxxxxxxxx]On
> Behalf Of Rene Zandbergen
> Sent: Tuesday, July 22, 2003 2:54 AM
> To: vms-list@xxxxxxxxxxx
> Subject: Re: VMs: Rene: Stat questions
>
>
>
> --- GC <glenclaston@xxxxxxxxxxx> wrote:
>
> > Rene,
> >
> > In regard to your page
> > http://www.voynich.nu/extra/lang.html, a question.
> >
> > Your page says:
> >
> >> The correlation between two pages was defined
> >> as the number of words common to both pages.
> >> [...] Obviously, the number of common words
> >> depends heavily on the number of words on each
> >> page. [...] a normalisation factor had to be
> >> used. This factor was chosen as a constant
> >> divided by the square root of the product of
> >> the number of words on the two pages being
> >> compared.
>
> > Are you speaking of a chi2 standardization factor?
> > And why use something
> > like this, when you have a verifiable count of pages
> > and words in each
> > "language"? What is the downside of using a
> > b-total/a-total as
> > standardization factor, and then moderating that
> > against the page variant
> > stats? "A-page is a percentage of all A
> > pages/B-page is a percentage of all
> > B pages, standardized by the percentage of b/a
> > pages?
>
> Do you mean B/A as in Currier's language? First
> of all I wanted to check the validity of this
> identification of two distinct languages, so there
> is no real preconception about each page. The
> variability of the length is a bit of a nuisance.
> I wanted to allow the possibility that there were
> also C and D languages, or transition languages.
>
> Then, I wanted to obtain one number for each pair
> of pages (the 'correlation' between the two pages),
> not a pair of number for the pair of pages (which
> could for example be the two fractions as you
> suggest). If one calculates the two fractions,
> one gets two different numbers if the page lengths
> are not the same. In order to make this one
> number inter-comparable for each pair of pages,
> they should be made independent of the lenghts of
> the pages.
> My problem was when comparing a short page with
> a long page.
>
> Also, if you have one reference page and want
> to compare it with all other pages, the fraction
> of words on your reference page common with all
> other pages depends on the length of these other
> pages.
>
> Anyway, I was not completely happy with it and
> if you read the follow-on article, you see a
> completely different 'correlation' statistic,
> which I preferred since it is less dependent on
> incidental spelling or transcription errors.
>
> In the end, in my opinion there are no clear
> C and D languages but there are transition
> areas. This makes it less likely (to me)
> that the VMs is from the hand of more than
> one person. The observed drift could be the
> result of a multitude of causes, linguistic
> or cryptographic, or even combined.
> While I can't prove it, the one thing it has
> convinced me of is that it identifies (roughly)
> the order in which the text on the pages was
> composed.
>
> Cheers ,Rene
>
> __________________________________
> Do you Yahoo!?
> Yahoo! SiteBuilder - Free, easy-to-use web site design software
> http://sitebuilder.yahoo.com
> ______________________________________________________________________
> To unsubscribe, send mail to majordomo@xxxxxxxxxxx with a body saying:
> unsubscribe vms-list
______________________________________________________________________
To unsubscribe, send mail to majordomo@xxxxxxxxxxx with a body saying:
unsubscribe vms-list