Sukhotin algorithm for vowel recognition

DFS346 · Post by **DFS346** » Fri Oct 20, 2023 1:22 pm

Member bi3mw of the Voynich Ninja forum drew my attention to a Python code for vowel recognition, based on the algorithm of Sukhotin. The author is @nezzcarth from the German Python forum.
https://www.voynich.ninja/thread-3901-p ... l#pid52879

With a view to applying the code to the Voynich manuscript, I selected the v101 transliteration. There were at least two issues to address, as follows:
• The online Python editor that I am using (at https://brython.info/tests/editor.html?lang=en) will accept up to about four pages of v101 text; whereas, since the Sukhotin algorithm is a statistical concept, it would be preferable to use longer sequences of text.
• The code apparently does not distinguish between upper and lower case in the text; whereas the v101 transliteration uses upper case for several common glyphs, notably those transliterated as A, C and H.

I therefore modified the v101 transliteration, by making the following replacements of keyboard assignments:
A => â, C => ĉ, E => ê, G => ĝ, H => ĥ, I => î, K => ķ, N => ń, S => ŝ, Y => ŷ, Z => ź.
This enabled the code to distinguish the common glyphs which have upper-case assignments in v101; those that remain (for example B, D, F etc) are relatively rare.

DFS346 · Post by **DFS346** » Fri Oct 20, 2023 1:23 pm

Taking an example of a possible precursor language, the frequencies of the vowels in medieval Italian (as per the OVI corpus) are as follows:
E 12.7% (excluding È and É)
A 10.0%
I 9.4%
O 9.3%
U 2.8% (excluding U appearing as part of QU).

AquilaPausaLoquitur · Post by **AquilaPausaLoquitur** » Sat Oct 21, 2023 9:35 am

interesting.
I am currently working on a OCR that can read Voynichese directly from the hires pictures...
and I noticed that some symbols were transliterated as two or more chars when they could also have been 1 (or inversely).
I had to make a choice (see picture) is the VMS letter one or three (CHC or H)... this choice probable account for the high number of
consecutive similar letters (such as II as in 'aiin' which occurs a lot in the ZL_ivtff_2a transliteration zbs: <f1r.3,+P0> syaiir.sheky.or.ykaiin.shod.cthoary.cthes.daraiin.sy).

this OCR obliged me to also work on a new transliteration which will enable me to re-run all my tests and hopefully finally progress

cheers

: OCR workbench; Screenshot 2023-10-21.jpg (166.21 KiB) Viewed 11257 times

DFS346 · Post by **DFS346** » Sat Oct 21, 2023 4:47 pm

MarcoP of the Voynich Ninja forum alerted me to the probability that @nezzcarth's code reads the space character (which is the most common character in the v101 transliteration) as a vowel.

If so, all glyphs which occur mainly in the initial position (such as 4o), or mainly in the final position (such as m, M, n and N) will be read as adjacent to a vowel and therefore identified as consonants.

I am not a programmer and cannot modify the code in order to ignore the spaces. However, I noticed that @nezzcarth's code ignores line breaks, or at least does not treat a line break as a character. I therefore experimented with replacing all spaces with line breaks.

DFS346 · Post by **DFS346** » Mon Oct 23, 2023 11:48 am

This is to report on my renewed effort to run @nezzcarth's Python code for the Sukhotin algorithm, on individual pages of the v101 transliteration. I made the following modifications to the transliteration:

replaced all of the upper-case keyboard assignments with Unicode accented lower-case characters, for example as follows:
A => â, C => ĉ, E => ê, G => ĝ, H => ĥ, I => î, K => ķ, N => ń, Y => ŷ, Z => ź.

replaced all occurrences of 4o (which I believe is a single glyph) with the Unicode character ④

replaced all spaces (.) and uncertain spaces (,) with line breaks.
(This was my attempt to ensure that the code would not necessarily treat predominantly initial glyphs, like ④, and predominantly final glyphs, like m, as consonants).

DFS346 · Post by **DFS346** » Tue Oct 24, 2023 4:39 am

Since I went down some false trails, I modified some of my earlier posts, and I take the liberty of linking to a revised summary which I posted on my GoodReads/Amazon blog:

https://www.goodreads.com/author_blog_p ... -algorithm

DFS346 · Post by **DFS346** » Sun Oct 29, 2023 2:47 pm

Having run the Sukhotin algorithm on the first 120 pages of the Voynich manuscript, I provisionally identified the v101 glyphs o, a, 1, ④ and 2 as probable vowels. Here ④ stands for 4o, which I believe is a single glyph.

Working on the remaining pages.

DFS346 · Post by **DFS346** » Mon Nov 06, 2023 12:53 pm

These are the results of further experiments with the Sukhotin algorithm for the identification of vowels in the Voynich manuscript. I used the algorithm as represented by @nezzcarth's Python code.

As a starting point, I used Glenn Claston’s v101 transliteration with the sole modification of replacing all occurrences of 4o (which I believe is a single glyph) with the Unicode character ④.

Further variations of the source document were as follows:
• v120: combining all variants of the glyph 2 (i.e. 2, 3, !, #, %, +)
• v121: treating all variants of the glyph 2 as variants of the glyph 1
• v130: combining all variants of the glyph 8 (i.e. 6, 7, 8, &)
• v140: combining variants of the glyph 9 (I.e. (, 9)
• v150: treating C as a variant of cc, and deconstructing B, d and D as follows: B = cc£, d = ccc, D = ccc£
• v151: treating cc as a variant of C, and deconstructing B, d and D as follows: B = C£, d = cC, D = cC£
• v160: deconstructing the “bench” gallows glyphs as follows: F = f1, G = g1, etc.
• v161: deconstructing the “bench” gallows glyphs as follows: F = 1f, G = 1g, etc.
• v170: deconstructing the glyphs m, M and n into strings of i plus N
• v171: deconstructing the glyphs m, M and n into strings of i and I plus N
• v190: treating A as a variant of o.

I am planning further runs of the Sukhotin algorithm with other variations of the v101 transliteration, for example combining x, X, y and Y, and deconstructing z, Z ; and disaggregating the samples between Currier Language A and B.

I am working on increasing the sample sizes. For the v101④ transliteration, I plan to test all 227 pages. For the other variations, I am planning on random samples. For a universe of 227 pages, a random sample of 53 pages should yield results with 90 percent confidence of an uncertainty of plus or minus 10 percent.

In all of the variant transliterations, the algorithm identifies 1, o, ④ and a as probable vowels. Candidates for the next most probable vowels include C, 8, I and c.

DFS346 · Post by **DFS346** » Thu Nov 09, 2023 2:31 pm

With regard to the ubiquitous v-101 glyph 9, I have been wondering for some time whether it has the same meaning, or function, at the beginning of a "word" as at the end, or elsewhere. My tests with the Sukhotin algorithm suggest that it does not.

My approach was as follows.

I started with my v202 transliteration, which I view as a cleaned-up variant of v101: for example, it combines v101 glyphs such as 6, 7 and 8 which seem to be visually similar; (like EVA) it deconstructs v101 glyphs such as m and n that look like they contain strings of i, I and N; and it treats all the variants of 2 (3, 5, ! etc) as equivalent to 1 plus an accent of unknown significance.

I developed a variant which I called v203, in which the glyph 9 is represented by four Unicode characters:

₉ for occurrences at the beginning of a "word".

9 for occurrences in the interior of a "word"

⁹ for occurrences at the end of a "word"

⑨ for the glyph in isolation (i.e. a single-glyph "word").

I ran @nezzcarth's Python code for the Sukhotin algorithm on 53 pages from the v203 transliteration, randomly selected using the Excel RAND() function. This sample size should yield results with an uncertainty of plus or minus 10 percent, with 90 percent confidence. I confirmed that the code distinguished the four Unicode characters for the four positions of the glyph 9.

The algorithm identified the variants of 9 as follows:

initial 9 (1,674 occurrences): probable vowel (91% of sampled pages)

interior 9 (444 occurrences): possible vowel (40%)

final 9 (15,485 occurrences): probable consonant (100%)

isolated 9 (235 occurrences): probable consonant (100%)

.

DFS346 · Post by **DFS346** » Fri Nov 10, 2023 7:13 am

The v-101 glyph o is found predominantly at the beginnings and in the interior of "words", and to a much lesser extent at the ends of "words". Is an initial o the same as a final o? Again, my tests with the Sukhotin algorithm suggest that it is not.

My v204 transliteration is a derivative of v202 which distinguishes variants of o according to the position of the glyph within the "word". I ran @nezzcarth's Python code for the Sukhotin algorithm on 53 randomly selected pages from v204.

The algorithm identified the variants of o as follows:
• initial o (8,866 occurrences): probable vowel (96% of sampled pages)
• interior o (10,039 occurrences): probable vowel (89%)
• final o (1,432 occurrences): probable consonant (100%)
• isolated o (210 occurrences): probable consonant (100%).

Voynich Net Forum

Sukhotin algorithm for vowel recognition

Sukhotin algorithm for vowel recognition

Re: Sukhotin algorithm for vowel recognition

Re: Sukhotin algorithm for vowel recognition

Re: Sukhotin algorithm for vowel recognition

Re: Sukhotin algorithm for vowel recognition

Re: Sukhotin algorithm for vowel recognition

Re: Sukhotin algorithm for vowel recognition

Re: Sukhotin algorithm for vowel recognition

Re: Sukhotin algorithm for vowel recognition

Re: Sukhotin algorithm for vowel recognition