Sukhotin algorithm for vowel recognition
Forum rules
All ideas are welcome, but please be civil with each other.
All ideas are welcome, but please be civil with each other.
Sukhotin algorithm for vowel recognition
Member bi3mw of the Voynich Ninja forum drew my attention to a Python code for vowel recognition, based on the algorithm of Sukhotin. The author is @nezzcarth from the German Python forum.
https://www.voynich.ninja/thread-3901-p ... l#pid52879
With a view to applying the code to the Voynich manuscript, I selected the v101 transliteration. There were at least two issues to address, as follows:
• The online Python editor that I am using (at https://brython.info/tests/editor.html?lang=en) will accept up to about four pages of v101 text; whereas, since the Sukhotin algorithm is a statistical concept, it would be preferable to use longer sequences of text.
• The code apparently does not distinguish between upper and lower case in the text; whereas the v101 transliteration uses upper case for several common glyphs, notably those transliterated as A, C and H.
I therefore modified the v101 transliteration, by making the following replacements of keyboard assignments:
A => â, C => ĉ, E => ê, G => ĝ, H => ĥ, I => î, K => ķ, N => ń, S => ŝ, Y => ŷ, Z => ź.
This enabled the code to distinguish the common glyphs which have upper-case assignments in v101; those that remain (for example B, D, F etc) are relatively rare.
https://www.voynich.ninja/thread-3901-p ... l#pid52879
With a view to applying the code to the Voynich manuscript, I selected the v101 transliteration. There were at least two issues to address, as follows:
• The online Python editor that I am using (at https://brython.info/tests/editor.html?lang=en) will accept up to about four pages of v101 text; whereas, since the Sukhotin algorithm is a statistical concept, it would be preferable to use longer sequences of text.
• The code apparently does not distinguish between upper and lower case in the text; whereas the v101 transliteration uses upper case for several common glyphs, notably those transliterated as A, C and H.
I therefore modified the v101 transliteration, by making the following replacements of keyboard assignments:
A => â, C => ĉ, E => ê, G => ĝ, H => ĥ, I => î, K => ķ, N => ń, S => ŝ, Y => ŷ, Z => ź.
This enabled the code to distinguish the common glyphs which have upper-case assignments in v101; those that remain (for example B, D, F etc) are relatively rare.
Last edited by DFS346 on Thu Oct 26, 2023 3:52 am, edited 4 times in total.
Re: Sukhotin algorithm for vowel recognition
Taking an example of a possible precursor language, the frequencies of the vowels in medieval Italian (as per the OVI corpus) are as follows:
E 12.7% (excluding È and É)
A 10.0%
I 9.4%
O 9.3%
U 2.8% (excluding U appearing as part of QU).
E 12.7% (excluding È and É)
A 10.0%
I 9.4%
O 9.3%
U 2.8% (excluding U appearing as part of QU).
Last edited by DFS346 on Tue Oct 24, 2023 4:06 am, edited 1 time in total.
-
- Posts: 9
- Joined: Sun Apr 12, 2020 10:03 am
- Location: Quebec City, Canada,
Re: Sukhotin algorithm for vowel recognition
interesting.
I am currently working on a OCR that can read Voynichese directly from the hires pictures...
and I noticed that some symbols were transliterated as two or more chars when they could also have been 1 (or inversely).
I had to make a choice (see picture) is the VMS letter one or three (CHC or H)... this choice probable account for the high number of
consecutive similar letters (such as II as in 'aiin' which occurs a lot in the ZL_ivtff_2a transliteration zbs: <f1r.3,+P0> syaiir.sheky.or.ykaiin.shod.cthoary.cthes.daraiin.sy).
this OCR obliged me to also work on a new transliteration which will enable me to re-run all my tests and hopefully finally progress
cheers
I am currently working on a OCR that can read Voynichese directly from the hires pictures...
and I noticed that some symbols were transliterated as two or more chars when they could also have been 1 (or inversely).
I had to make a choice (see picture) is the VMS letter one or three (CHC or H)... this choice probable account for the high number of
consecutive similar letters (such as II as in 'aiin' which occurs a lot in the ZL_ivtff_2a transliteration zbs: <f1r.3,+P0> syaiir.sheky.or.ykaiin.shod.cthoary.cthes.daraiin.sy).
this OCR obliged me to also work on a new transliteration which will enable me to re-run all my tests and hopefully finally progress
cheers
Re: Sukhotin algorithm for vowel recognition
MarcoP of the Voynich Ninja forum alerted me to the probability that @nezzcarth's code reads the space character (which is the most common character in the v101 transliteration) as a vowel.
If so, all glyphs which occur mainly in the initial position (such as 4o), or mainly in the final position (such as m, M, n and N) will be read as adjacent to a vowel and therefore identified as consonants.
I am not a programmer and cannot modify the code in order to ignore the spaces. However, I noticed that @nezzcarth's code ignores line breaks, or at least does not treat a line break as a character. I therefore experimented with replacing all spaces with line breaks.
If so, all glyphs which occur mainly in the initial position (such as 4o), or mainly in the final position (such as m, M, n and N) will be read as adjacent to a vowel and therefore identified as consonants.
I am not a programmer and cannot modify the code in order to ignore the spaces. However, I noticed that @nezzcarth's code ignores line breaks, or at least does not treat a line break as a character. I therefore experimented with replacing all spaces with line breaks.
Last edited by DFS346 on Tue Oct 24, 2023 4:07 am, edited 1 time in total.
Re: Sukhotin algorithm for vowel recognition
This is to report on my renewed effort to run @nezzcarth's Python code for the Sukhotin algorithm, on individual pages of the v101 transliteration. I made the following modifications to the transliteration:
- replaced all of the upper-case keyboard assignments with Unicode accented lower-case characters, for example as follows:
A => â, C => ĉ, E => ê, G => ĝ, H => ĥ, I => î, K => ķ, N => ń, Y => ŷ, Z => ź.
- replaced all occurrences of 4o (which I believe is a single glyph) with the Unicode character ④
- replaced all spaces (.) and uncertain spaces (,) with line breaks.
(This was my attempt to ensure that the code would not necessarily treat predominantly initial glyphs, like ④, and predominantly final glyphs, like m, as consonants).
Last edited by DFS346 on Wed Nov 08, 2023 2:36 pm, edited 6 times in total.
Re: Sukhotin algorithm for vowel recognition
Since I went down some false trails, I modified some of my earlier posts, and I take the liberty of linking to a revised summary which I posted on my GoodReads/Amazon blog:
https://www.goodreads.com/author_blog_p ... -algorithm
https://www.goodreads.com/author_blog_p ... -algorithm
Re: Sukhotin algorithm for vowel recognition
Having run the Sukhotin algorithm on the first 120 pages of the Voynich manuscript, I provisionally identified the v101 glyphs o, a, 1, ④ and 2 as probable vowels. Here ④ stands for 4o, which I believe is a single glyph.
Working on the remaining pages.
Working on the remaining pages.
Re: Sukhotin algorithm for vowel recognition
These are the results of further experiments with the Sukhotin algorithm for the identification of vowels in the Voynich manuscript. I used the algorithm as represented by @nezzcarth's Python code.
As a starting point, I used Glenn Claston’s v101 transliteration with the sole modification of replacing all occurrences of 4o (which I believe is a single glyph) with the Unicode character ④.
Further variations of the source document were as follows:
• v120: combining all variants of the glyph 2 (i.e. 2, 3, !, #, %, +)
• v121: treating all variants of the glyph 2 as variants of the glyph 1
• v130: combining all variants of the glyph 8 (i.e. 6, 7, 8, &)
• v140: combining variants of the glyph 9 (I.e. (, 9)
• v150: treating C as a variant of cc, and deconstructing B, d and D as follows: B = cc£, d = ccc, D = ccc£
• v151: treating cc as a variant of C, and deconstructing B, d and D as follows: B = C£, d = cC, D = cC£
• v160: deconstructing the “bench” gallows glyphs as follows: F = f1, G = g1, etc.
• v161: deconstructing the “bench” gallows glyphs as follows: F = 1f, G = 1g, etc.
• v170: deconstructing the glyphs m, M and n into strings of i plus N
• v171: deconstructing the glyphs m, M and n into strings of i and I plus N
• v190: treating A as a variant of o.
I am planning further runs of the Sukhotin algorithm with other variations of the v101 transliteration, for example combining x, X, y and Y, and deconstructing z, Z ; and disaggregating the samples between Currier Language A and B.
I am working on increasing the sample sizes. For the v101④ transliteration, I plan to test all 227 pages. For the other variations, I am planning on random samples. For a universe of 227 pages, a random sample of 53 pages should yield results with 90 percent confidence of an uncertainty of plus or minus 10 percent.
In all of the variant transliterations, the algorithm identifies 1, o, ④ and a as probable vowels. Candidates for the next most probable vowels include C, 8, I and c.
As a starting point, I used Glenn Claston’s v101 transliteration with the sole modification of replacing all occurrences of 4o (which I believe is a single glyph) with the Unicode character ④.
Further variations of the source document were as follows:
• v120: combining all variants of the glyph 2 (i.e. 2, 3, !, #, %, +)
• v121: treating all variants of the glyph 2 as variants of the glyph 1
• v130: combining all variants of the glyph 8 (i.e. 6, 7, 8, &)
• v140: combining variants of the glyph 9 (I.e. (, 9)
• v150: treating C as a variant of cc, and deconstructing B, d and D as follows: B = cc£, d = ccc, D = ccc£
• v151: treating cc as a variant of C, and deconstructing B, d and D as follows: B = C£, d = cC, D = cC£
• v160: deconstructing the “bench” gallows glyphs as follows: F = f1, G = g1, etc.
• v161: deconstructing the “bench” gallows glyphs as follows: F = 1f, G = 1g, etc.
• v170: deconstructing the glyphs m, M and n into strings of i plus N
• v171: deconstructing the glyphs m, M and n into strings of i and I plus N
• v190: treating A as a variant of o.
I am planning further runs of the Sukhotin algorithm with other variations of the v101 transliteration, for example combining x, X, y and Y, and deconstructing z, Z ; and disaggregating the samples between Currier Language A and B.
I am working on increasing the sample sizes. For the v101④ transliteration, I plan to test all 227 pages. For the other variations, I am planning on random samples. For a universe of 227 pages, a random sample of 53 pages should yield results with 90 percent confidence of an uncertainty of plus or minus 10 percent.
In all of the variant transliterations, the algorithm identifies 1, o, ④ and a as probable vowels. Candidates for the next most probable vowels include C, 8, I and c.
Last edited by DFS346 on Sat Nov 11, 2023 4:20 am, edited 1 time in total.
Re: Sukhotin algorithm for vowel recognition
With regard to the ubiquitous v-101 glyph 9, I have been wondering for some time whether it has the same meaning, or function, at the beginning of a "word" as at the end, or elsewhere. My tests with the Sukhotin algorithm suggest that it does not.
My approach was as follows.
I started with my v202 transliteration, which I view as a cleaned-up variant of v101: for example, it combines v101 glyphs such as 6, 7 and 8 which seem to be visually similar; (like EVA) it deconstructs v101 glyphs such as m and n that look like they contain strings of i, I and N; and it treats all the variants of 2 (3, 5, ! etc) as equivalent to 1 plus an accent of unknown significance.
I developed a variant which I called v203, in which the glyph 9 is represented by four Unicode characters:
The algorithm identified the variants of 9 as follows:
My approach was as follows.
I started with my v202 transliteration, which I view as a cleaned-up variant of v101: for example, it combines v101 glyphs such as 6, 7 and 8 which seem to be visually similar; (like EVA) it deconstructs v101 glyphs such as m and n that look like they contain strings of i, I and N; and it treats all the variants of 2 (3, 5, ! etc) as equivalent to 1 plus an accent of unknown significance.
I developed a variant which I called v203, in which the glyph 9 is represented by four Unicode characters:
- ₉ for occurrences at the beginning of a "word".
- 9 for occurrences in the interior of a "word"
- ⁹ for occurrences at the end of a "word"
- ⑨ for the glyph in isolation (i.e. a single-glyph "word").
The algorithm identified the variants of 9 as follows:
- initial 9 (1,674 occurrences): probable vowel (91% of sampled pages)
- interior 9 (444 occurrences): possible vowel (40%)
- final 9 (15,485 occurrences): probable consonant (100%)
- isolated 9 (235 occurrences): probable consonant (100%)
Last edited by DFS346 on Sat Nov 11, 2023 4:24 am, edited 4 times in total.
Re: Sukhotin algorithm for vowel recognition
The v-101 glyph o is found predominantly at the beginnings and in the interior of "words", and to a much lesser extent at the ends of "words". Is an initial o the same as a final o? Again, my tests with the Sukhotin algorithm suggest that it is not.
My v204 transliteration is a derivative of v202 which distinguishes variants of o according to the position of the glyph within the "word". I ran @nezzcarth's Python code for the Sukhotin algorithm on 53 randomly selected pages from v204.
The algorithm identified the variants of o as follows:
• initial o (8,866 occurrences): probable vowel (96% of sampled pages)
• interior o (10,039 occurrences): probable vowel (89%)
• final o (1,432 occurrences): probable consonant (100%)
• isolated o (210 occurrences): probable consonant (100%).
My v204 transliteration is a derivative of v202 which distinguishes variants of o according to the position of the glyph within the "word". I ran @nezzcarth's Python code for the Sukhotin algorithm on 53 randomly selected pages from v204.
The algorithm identified the variants of o as follows:
• initial o (8,866 occurrences): probable vowel (96% of sampled pages)
• interior o (10,039 occurrences): probable vowel (89%)
• final o (1,432 occurrences): probable consonant (100%)
• isolated o (210 occurrences): probable consonant (100%).
Last edited by DFS346 on Sun Nov 12, 2023 3:31 pm, edited 1 time in total.