VMs: RE: Character Frequency Analysis

Thanks! That is one of the largest problems of a frequency count - what is the underlying glyph?

In this case "c" is usually the "c" in the "ch", "cPh", etc glyphs and most other appearances of a "c" like character are listed as "e". But I think it would be constructive to try distinguishing between "ii" (which in several cases I believe is 'u') and "iin" and some others to see what the structure looks like then. The best I can say about the current stats is that it they are consistent within themselves. The main idea I was working from doing this was to see if the pages had a similar set of statistics, and it is apparent that this is not always so. Hopefully that might clue something to someone who could take it and run.

I'll go back to one of the more verbose pages and try your suggestions in various ways to see what shakes out.

Personally, I find the word ending stats particularly fascinating. There are relatively very few glyphs that end words. Too few in my book. "y" has an huge, commanding lead. On one page it is over 70% of the word endings - (though it generally averages around 30%).

The other interesting thing is that the distribution curve fits very nicely for a monoalphabetic cipher - though we know that is very unlikely that this is the case (unless the underlying language is either unknown or invented). If it is monoalphabetic then there is a mixture of languages underlying the text (or the cipher changes between pages such as most of the f20s and f76). Whoever created this thing was a genius. Or incredibly mad. Maybe a Mad Genius (Simon Barsinister?) <grin>.

I have another 30 pages done and will post those with some of the suggestions you have made. I also want to run some samples of various other languages through the program to see what comes out. I would dearly love to find textual versions of old herbals I could cut-and-paste, but all I have found so far are page captures.

******************************
Larry Roux
Syracuse University
lroux@xxxxxxx
*******************************

>>> John@xxxxxxxxxxxx 08/23/03 03:06PM >>>

Well... there sure was a lot to read after my holidays!

However, I'll limit my response to Larry's work which showed some interesting coincidence in folio 26 and 31 compared to the

folio's surrounding them. First, I think I may have asked this before - but what's the difference between a standalone 'c' and a standalone 'e'?

The stats seem to show the popularity of each of these as separate... I think vladimir is right that common constructs like 'ch', 'cph', 'cth',

etc... should be counted separately - and any standalone 'c or e' are treated always as 'e'. Then, I'd like to see a frequency count that includes

the frequency of 'e', 'ee', 'eee', 'eeee', and 'i', 'ii', 'iii', 'iiii' as well. In the count, I think that when any character is repeated it should be counted as

a whole -- that is 'eee' doesn't count as 'eee' and 'ee'+e and 'e'+'e'+'e'... it only counts as one occurrence of 'eee'.

John.

-----Original Message-----
From: owner-vms-list@xxxxxxxxxxx [mailto:owner-vms-list@xxxxxxxxxxx]On Behalf Of Larry Roux
Sent: Friday, August 22, 2003 11:54 PM
To: <
Subject: VMs: Character Frequency Analysis

I am a bit over half done with the page-by-page character frequency counts.

The (raw) data so far is posted to http://web.syr.edu/~laroux/VoyAnalysis.htm

I still find it weird that many consecutive pages have very similar statistics.

I wonder if this sort of thing can help confirm some of your suspicions that pages may be foliated in an incorrect order...

There are also some results from my program that I started putting in the page for characters that begin words and end words.

And of course my "English Text Hidden in the Voy" page is still there at http://web.syr.edu/~laroux/VoyEng.htm

One of these days I will get organized and make a neat page like all you'se guys (and gals) have, but ... one project at a time....

Larry

******************************
Larry Roux
Syracuse University
lroux@xxxxxxx
*******************************