Sukhotin algorithm for vowel recognition

DFS346 · Post by **DFS346** » Sat Nov 11, 2023 4:19 am

Here's a report on a larger sample from my v206, v207 and v210 transliterations. The v210 transliteration has four variants of the glyph 9, and four variants of the glyph o; v206 has variants of a; v207 has variants of 1. In all cases the variants are based on the position of the glyph in the word (initial, interior, final and isolated). I ran @nezzcarth’s Python code for the Sukhotin algorithm on randomly sampled pages of v206, v207 and v210.

The glyphs most frequently identified as probable vowels were as follows:
• initial and interior 1
• initial and interior o
• ④ = 4o
• initial 9 (but not final 9)
• initial and interior a
• C.

DFS346 · Post by **DFS346** » Mon Nov 13, 2023 2:58 pm

If the foregoing results from the Sukhotin algorithm have any merit, the logical next step is to try to match up the frequencies of the vowel-glyphs and consonant-glyphs with those of the vowels and consonants, respectively, in a possible precursor language.

As an example (and this is one of many possibilities), I have experimented with medieval Latin, as represented by Dante Alighieri’s De Monarchia, written between 1312 and 1313.

My idea was that the frequencies of the Voynich glyphs identified as vowels should align with the frequencies of the Latin vowels, which in De Monarchia are (in order of frequency) I, E, A, U and O. Here I was obliged to include the glyph 9 (in the final position) as a probable vowel, even though the Sukhotin algorithm does not identify it as such. The reason is that it is one of the few glyphs which are sufficiently frequent to be able to represent a Latin vowel.

Likewise, the frequencies of the Voynich glyphs identified as consonants should align with the frequencies of the Latin consonants, which in De Monarchia are T, S, N, R, M and so on.

Having done this, I found that the frequencies matched up quite well. Naturally we should not expect the alignment to be exact; for example in De Monarchia, N and R have similar frequencies, and maybe should be swapped around. Probably also, there will be some glyphs which are rare or illegible; my inclination is to ignore them.

Subject to these uncertainties, it should be possible to make a provisional transliteration of any page of the Voynich manuscript into medieval Latin. Perhaps the resulting text might include some recognisable Latin words.

DFS346 · Post by **DFS346** » Tue Nov 14, 2023 8:50 am

In my experiments with the Sukhotin algorithm, I adopted the working assumption that the “words” in the Voynich manuscript corresponded to words in the presumed precursor documents. However, it occurred to me that within each Voynich “word”, the glyphs might not necessarily follow the same order as the letters in the precursor word.

One possibility that I considered was that the Voynich “text” was written in the reverse order to that of the precursor text.

Intuitively, it seemed to me that the Sukhotin algorithm, since it operates on adjacent letters, should not care whether the letters and words were in the original order, or reversed. Therefore it should identify the same vowels in the reversed text as in the original text.

As a matter of due diligence, I ran @nezzcarth’s code for the Sukhotin algorithm on Abraham Lincoln’s Gettysburg Address, and on the same text reversed (for example, “four score and seven years ago” became “oga sraey neves dna erocs ruof”). The results were as follows:

• original text: vowels identified as A, B, E, H, I, O, P, U
• reversed text: vowels identified as A, E, G, I, K, O, U, Y.

In both cases the algorithm correctly identified the five English vowels A, E, I, O and U, and mis-identified three consonants (but not the same ones in the reversed text as in the original text).

DFS346 · Post by **DFS346** » Tue Nov 14, 2023 11:07 am

I have run @nezzcarth’s code for the Sukhotin algorithm on a version of my v210 transliteration (which I designated v210R), in which the text of each sampled page is reversed. For example on folio f2r, where v210 includes the line “h98aiń⁹ ₉g1ôe 8aîń ók1ae ₉g1aîń h1ôes⁹”, in v210R this line becomes “⁹seô1h ńîa1g₉ ea1kó ńîa8 eô1g₉ ⁹ńia89h”.

Contrary to my expectations as to how the algorithm should work, the code identified a substantially different set of glyphs as probable vowels.

I had the impression that the code tended to identify initial glyphs as vowels, in preference to glyphs in interior or final positions. (In the reversed text, final glyphs become initial glyphs.)

If in this forum there are any Python coders, I would like to invite their ideas as to how to modify either the code, or the input text, to avoid any possible bias towards initial glyphs as vowels.

DFS346 · Post by **DFS346** » Wed Nov 15, 2023 3:26 pm

This is a first synthesis of my thoughts on the glyphs in the Voynich manuscript which are most likely to represent vowels in the presumed precursor documents.

https://www.goodreads.com/author_blog_p ... -revisited

DFS346 · Post by **DFS346** » Sat Nov 18, 2023 10:50 am

Having some reservations about @nezzcarth's Python code for the Sukhotin algorithm, I ran an alternative code developed by Dr Mans Hulden of the University of Colorado at Boulder (as slightly modified by MarcoP of the Voynich Ninja forum). I applied Dr Hulden's code to successively larger extracts from folio f1r, v211 transliteration, as follows:

first line (original and reversed)

first five lines (original and reversed)

first ten lines (original and reversed)

first twenty lines (original and reversed).

The results seem encouraging, in the following respects:

the vowels identified remain substantially the same, as the length of the extract increases

the order of probability of the vowels remains substantially the same

the results are the same for the reversed text as for the original text.

As a way of capturing the probability ranking of the vowels, I gave the following scores to each identified vowel:

10 for the most probable vowel

9 for the second most probable vowel

and so on, down to 1 for the 10th most probable vowel, and 0.5 for all less probable vowels.

I am continuing with randomly sampled pages from the whole manuscript, in each case taking the whole page as the input file.

DFS346 · Post by **DFS346** » Thu Nov 23, 2023 11:31 am

We ran Dr Mans Hulden's code for the Sukhotin algorithm on the Voynich manuscript, v211 transliteration, separately on all pages in Language A, and on all pages in Language B. Most probable vowels (in order of probability):

Language A - interior o, interior a, interior 1, final 9, initial o;

Language B - interior a, c, final 9, interior o, initial o.

In the case of the v101 glyphs a, o, 1 and 9, we used Unicode characters to distinguish glyphs in the initial, interior, final and isolated positions. In the cases of a, o and 1, it is possible that all positions represent the same letter in the presumed precursor documents. However, the Sukhotin algorithm identified final 9 as a vowel and initial 9 as a consonant. In my mind, this lends weight to the hypothesis that final 9 and initial 9 do not represent the same precursor letter.

I think that this result also lends weight to the hypothesis that the precursor languages were, or included, abbreviated Latin or abbreviated Italian. As Adriano Cappelli documented, in these languages an abbreviation symbol resembling a 9 had different meanings in the initial and final positions.

DFS346 · Post by **DFS346** » Sat Nov 25, 2023 10:32 am

We ran Dr Mans Hulden's code for the Sukhotin algorithm on each of the themed sections of the Voynich manuscript, v211 transliteration.

The algorithm identified the following glyphs as the most probable vowels (in order of probability):

"herbal" pages: interior o, interior a, 'interior 1, final 9, initial o

"biological" pages: 8, h, e, ' (our catch-all symbol for the accents on 2 and its variants), k

"pharmaceutical" pages: interior o, interior a, h, final 9, '

"text-only" pages: interior a, c, interior o, final 9, initial o

"text with stars" pages: interior a, c, C, initial o, interior o

"astronomy" pages: interior o, interior a, final 9, h, k

"cosmology" pages: interior a, c, initial o, interior o, final 9

"zodiac" pages: interior a, c, initial o, interior o, final 9.

The results for the "herbal" pages are practically identical with those for the pages in Language A, as expected (since, as per Dr Lisa Fagin Davis, 76 percent of the lines in these pages are in Language A). The results for the "text-only", "text with stars", "cosmology" and "zodiac" pages closely resemble those for Language B.

Davis considered the "biological" pages to be entirely in Language B, but the vowel identification on these pages has no commonality with that for either Language A or Language B.

Davis considered the "pharmaceutical" pages to be entirely in Language A, and the "astronomy" pages to be in Language B, but the vowel identifications on these pages have some commonality with both Language A and Language B.

My first reading of these results is that if the "biological", "pharmaceutical" and "astronomy" pages have precursors in natural languages, those languages are not the precursors of Language A or B; or alternatively, that the mappings from those languages to Voynich glyphs are different from the mappings to Languages A and B.

DFS346 · Post by **DFS346** » Sat Nov 25, 2023 5:16 pm

In the light of our runs of the Sukhotin algorithm, I'm inclined to think that the text of the Voynich manuscript might be productively viewed as incorporating at least five languages, as follows:

• Language H: herbal section
• Language T: text-only, text with stars, "cosmology" (including the “rosettes” page), and zodiac section
• Language C: biological or balneological section
• Language D: pharmaceutical or "recipes" section
• Language E: “astronomy” section (meaning the pages with representations of the sun, moon and stars).

In terms of vocabulary, Language H is statistically almost identical to Currier's Language A. Likewise, Language T is statistically very similar to Currier's Language B.

The concept of five languages can be tested as follows:
• calculating the frequencies of the glyphs in each section;
• matching vowel-glyphs (as identified by the Sukhotin algorithm) with vowels in selected potential precursor languages, on the basis of the frequency ranking;
• matching consonant-glyphs (as identified by the Sukhotin algorithm) with consonants in the selected precursor languages, on the basis of the frequency ranking;
• randomly selecting pages from each section and mapping glyphs to precursor letters
• examining the results for any indication of meaningful words.

We shall do some tests of this nature. As potential precursor languages, we will start with medieval Italian (as per the OVI corpus) and medieval Latin (as per Dante’s Monarchia). More later.

DFS346 · Post by **DFS346** » Sun Nov 26, 2023 4:52 pm

In order to test whether we are really seeing different languages in the thematic sections, we assembled data on the counts of the 20 most frequent "words" in Currier Language A and B, and in each of the thematic sections.

Language H seems to be very close to Currier Language A. Language T is similar to Currier Language B. Languages C, D and E seem to be hybrid or different languages altogether.

In response to a comment from RenėZ on the Voynich Ninja Forum, we defined new versions of the herbal and pharmaceutical sections, in which we re-assigned folios f87, f90 and f93-96 from the herbal to the pharmaceutical section. The results are below. My reading of these results is that the presumed Language H is still very similar to Currier Language A, and that the presumed Language D is a hybrid or a different language from either Currier A or B.

: 10 most frequent words by language and section.jpg (142.91 KiB) Viewed 2303 times

Voynich Net Forum

Sukhotin algorithm for vowel recognition

Re: Sukhotin algorithm for vowel recognition

Re: Sukhotin algorithm for vowel recognition

Re: Sukhotin algorithm for vowel recognition

Re: Sukhotin algorithm for vowel recognition

Re: Sukhotin algorithm for vowel recognition

Sukhotin algorithm for vowel recognition

Re: Sukhotin algorithm for vowel recognition

Re: Sukhotin algorithm for vowel recognition

Re: Sukhotin algorithm for vowel recognition

Re: Sukhotin algorithm for vowel recognition