[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

*To*: voynich@xxxxxxxx*Subject*: Re: Sukhotin's Algorithm*From*: Jorge Stolfi <stolfi@xxxxxxxxxxxxx>*Date*: Sat, 20 Jan 2001 04:44:58 -0200 (EDT)*In-reply-to*: <3A687572.48267625@voynich.nu>*References*: <482569D9.00059D4C.00@iss.nus.edu.sg> <3A67A016.CA33315C@alphalink.com.au> <3A686733.EE76386F@nctimes.net> <3A687008.C4B4490D@amu.edu.pl> <3A687572.48267625@voynich.nu>*Reply-to*: stolfi@xxxxxxxxxxxxx

Folks, here is the promised write-up on "why Sukhotin can't read". The bulk of it may be old news to most of you, but hppefully you will find some interesting bits in the last couple of sections. DIRECTED SIMILARITY Say that two symbols X, Y are "next-similar" if they tend to be followed by the same characters --- i.e., the probability distribution of the next symbol after an X is similar to that of the next symbol after any Y. Analogously, we can define "prev-similar" by looking at the distributions of the characters right before occurrences of X and of Y. In either definition, we may use statistics computed over the text tokens, or over the list of distinct words (the "lexicon"), ignoring the frequency of each word in the text. Experience seems to indicate that the second approach provides more consistent results, at least for Voynichese. (Which may be one point where Voynichese differs from natural languages). So, in what follows, symbol frequencies are computed over the lexicon by default. I don't know what is the right way to compare two distributions quantitatively, but this issue doesn't seem to be terribly important for what follows. So let's just assume that we have agreeed somehow on numeric measures dnext(X, Y) and dprev(X, Y) of the distance between the respective distributions, such that dnext(X, Y) = 0 when the distribution after a X is identical to that after Y; and similarly for dprev(). SIMILARITY CLASSES We can use either similarity measure to group the symbols into a pre-specified number k of "classes", where all letters in the same class are as similar as possible, and letters in different classes are as dissimilar as possible. To find such classification, we could use any clustering algorithm, or just tabulate the probabilities and do an approximate classification by hand. Sukhotin's algorithm can be seen as an instance of those clustering algorithms, specialized for the case k = 2. Granted, its goal function f() probably can't be expressed in terms of dnext() and dprev(). However, to a very rough approximation, we can say that Sukhotin's procedure tries to optimize some symmetric mixture of the two. That is, for most languages, two symbols will end up in the same Sukhotin class if their next- and previous-symbol distributions, taken together, are sufficiently similar. The justification for Sukhotin's algorithm, viewed in this light, is that we expect the discrepancies dnext() and dprev() to be much larger between a vowel and a consonant than between two vowels or two consonants --- if not for all languages, at least for vowel-rich ones like Spanish and Italian. In particular, for two vowels X and Y, the measure dnext(X, Y) will be lower than normal because both will tend to be be followed by the same subset fo letters (the consonsants), and ditto for dprev(X, Y). Therefore, it should not matter much whether we use dnext, dprev, or some combination of the two: the strongest split should be the V-C one in any case. VOYNICHESE SYMBOL CLASSES ARE WEIRD Well, that isn't the case of Voynichese, at least not with the standard alphabets. The first problem is that, with either metric, the widest splits occur between classes that are too small to be vowels, or are unlikely for other reasons. For instance, in the dnext() metric, the EVA letters { <y>, <n>, <m> } are very similar, and very different from the rest; however, that is simply because they almost always occur in word-final position, so that the next "character" is almost always a space. That is not how vowels (or consonants, for that matter) behave in most European languages. As another example, the four gallows { <k> <t> <p> <f> } letters also constitute a well-defined prev-class; but they cannot be the vowels, because there is at most one gallows in each word, and half of the words dont have any gallows. In fact, looking at the dnext() metric, one can readily see five or so well-distinguished classes, that occupy rather constrained positions with each word. These constraints are the basis for the crust-mantle-core model, and there doesn't seem to be anything like it in European languages. If the bizarre word structure was the only problem, we could get around it by looking at more exotic languages. Indeed, standard written Arabic has symbols that are constrained to occur only at the beginning or at the end of a word. And if the "words" are actually Chinese syllables with tones, we would indeed expect to see four of five classes with fairly strict constraints about glyph order. But... VOYNICHESE SYMBOL CLASSES ARE INCONSISTENT ... then we have another big problem: the two metrics dnext() and dprev() are quite different, and the classes defined by one cut across the classes of the other. For instance, looking at the /next-symbol/ distributions, we will find that EVA <a> and <i> are fairly next-similar, because they all tend to be followed by the a small set of letters (<i> chiefly, then <l>, <r>, <n>, <m> in varying amounts). On the other hand, <a> and <y> are extremely next-dissimilar, because <y> is hardly ever followed by those letters. However, if we look the otehr way, we will see that <a> and <y> are extremely prev-similar (preceded by space, <d>, <o>, <e>, <ch>, <sh>, <l>, <r>, <s>, and the gallows), but very unlike <i> (which can be preceded only by <a>, <i>, and <o>). This inconsistency explains why Sukhotin's algorithm --- which, if you recall, looks at both sides of each letter at once --- failed to produce a convincing. While the Suppose we ask someone to divide a deck of cards in exctly *two* piles, such that each pile is as homogeneous as possible in *both* suit and value. The result will necessarily be very inconsitent in both criteria: in the "low, black" pile he will have to put many red-low and many black-high cards. And indeed, while Sukhotin's algorithm consistently classifies as vowels the symbols <e>, <a>, <o>, <y> (which *are* indeed either prev-similar or next-similar to each other, and play vowel-like roles in the layer model), it also reports a few odd symbols like <s>, <p>, and <n>, which cannot possibly fit in the same class as those. Moreover, as said above, <a> and <y> are similar on one side only, and quite unlike on the other. Change the relative weights of dnext() and dprev() slightly, and the { <e>, <a>, <o>, <y> } class will fall apart. SO WHAT IS NEW? Those problems are probably old stuff. Anyone who has taken the time to look at the next-letter and previous-letter tables must have noticed the clear partition into the traditional letter classes (gallows, dealers, etc.), but also must have become aware of (and frustrated by) their peculiar positional constraints, and the left-right inconsistency. I myself have been chewing at that bone since I discovered the VMS. What I found out recently is that tehere seems to be some logic in the left-right asymmetry after all. As you all know, most Voynichese symbols are written with one or two two pen strokes, which we can label "left" and "right" in writing order. In particular, the dealers and circles are the result of combining either an <e> or <i> stroke on the left, with one of eight strokes (plumes, flourishes, ligature, or empty) on the right. Indeed, all but one of the 16 possible combinations are valid characters: 0 1 2 3 4 5 6 7 8 +---------------------------- i | i ii * l r j m n C | e | e a o y s d g b I The four gallows are similarly composed of two left halves and two right halves, in all combinations. (Benches and platform gallows are combinations of three or more strokes; let's leave them aside for the moment.) Now, the funny thing is that the next-symbol and prev-symbol distributions seem to be determined in large part by the symbol's left stroke and right stroke, respectively. In other words, symbols on the same row of the table tend to be prev-similar, while symbols on the same column tend to be next-similar. Thus, for example, <a> and <y>, which have an <e> stroke on the left, are fairly prev-similar but very next-different; while <a> and <i>, which are have an <i> stroke on the right, are very next-similar but quite prev-different. In fact, this rule seems to explain most of the inconsistencies and tantalizing hints that used to drive me crazy, such as the relationship between the dealers { <d> <l> <r> <s> }, the finals { <n> <m> <g> }, and the circles { <a> <o> <y> }. Now it seems almost logical, for example, that <l> and <y> can be prefixed to gallows, but <r> and <s> can't; while, on the other hand, <d> and <s> may follow <e>, while <l> and <r> may not. Also, it makes sense that all the four gallows <k> <t> <p> <f> are prev-similar, and yet <k> <t> are quite next-different from <p> <f> (the latter are never followed by <e>). To be sure, the above "stroke harmony" rule is not entirely consistent. For instance, <n> and <m>, which both have an <i> stroke on the left, are fairly prev-different --- <n> is almost always preceded by <i>, while <m> (like <r> and <l>) are more often preceded by <a>. But there is evidence anyway that <in> is a single letter, in which case the inconsistency may well not exist. Also, <d> and <s> are not as prev-similar to <a> as the theory would predict. WHAT DOES IT MEAN? There must be many theories that would explain this bizarre stroke harmony rule. Here is only one of them. Suppose that instead of writing the word "HEREBY" like this # # ##### #### ##### #### # # # # # # # # # # # # ##### #### #### #### #### # # # # # # # # # # # # ##### # # ##### #### # a scribe were to detach the last stroke of each letter and attach it to the following letter, like this: # # ##### #### ##### #### # # # # # # # # # # # # # # ####### #### #### #### ## # # # # # # # ## # # # ##### ## ##### #### Note that there is now a strong correlation between the last few strokes of each "letter" and the first stroke of the next one, as in Voynichese --- simply because they were originally parts of the same letter. (for instance, after a symbol with an "F"-like right side, one would often find one with a "foot" stroke on the left side.) Note also that different letters become equal, and equal letters become different. Note also the tokens would continue to obey Zipf's law, but all letter statistics would be fairly messed up. Yet, no information has been lost; and, with a little practice, our scribe could learn to write and read this "grammar-school cipher" as quickly and naturally as plain text. However, even without realizing it, he would have thrown a big monkey wrench into Sukhotin's algorithm --- and, in fact, into most of our sophisticated statistical methods... I'd better stop here, and try get some sleep. All the best, --stolfi

**References**:**Sukhotin's Algorithm***From:*Robert Firth

**Re: Sukhotin's Algorithm***From:*Jacques Guy

**Re: Sukhotin's Algorithm***From:*Mark Perakh

**Re: Sukhotin's Algorithm***From:*Rafal T. Prinke

**Re: Sukhotin's Algorithm***From:*Rene Zandbergen

- Prev by Date:
**Re: Sukhotin's algorithm** - Next by Date:
**Re: Sukhotin's Algorithm** - Previous by thread:
**Re: Sukhotin's Algorithm** - Next by thread:
**Re: Sukhotin's Algorithm** - Index(es):