# Re: Sukhotin's Algorithm

Folks, here is the promised write-up on "why Sukhotin can't read".
The bulk of it may be old news to most of you, but hppefully you will
find some interesting bits in the last couple of sections.

DIRECTED SIMILARITY

Say that two symbols X, Y are "next-similar" if they tend to be
followed by the same characters --- i.e., the probability distribution
of the next symbol after an X is similar to that of the next symbol
after any Y. Analogously, we can define "prev-similar" by looking at
the distributions of the characters right before occurrences of X and
of Y.

In either definition, we may use statistics computed over the text
tokens, or over the list of distinct words (the "lexicon"), ignoring
the frequency of each word in the text. Experience seems to indicate
that the second approach provides more consistent results, at least
for Voynichese. (Which may be one point where Voynichese differs from
natural languages).  So, in what follows, symbol frequencies
are computed over the lexicon by default.

I don't know what is the right way to compare two distributions
quantitatively, but this issue doesn't seem to be terribly important
for what follows. So let's just assume that we have agreeed somehow on
numeric measures dnext(X, Y) and dprev(X, Y) of the distance between
the respective distributions, such that dnext(X, Y) = 0 when the
distribution after a X is identical to that after Y; and similarly for
dprev().

SIMILARITY CLASSES

We can use either similarity measure to group the symbols into a
pre-specified number k of "classes", where all letters in the same
class are as similar as possible, and letters in different classes are
as dissimilar as possible. To find such classification, we could use
any clustering algorithm, or just tabulate the probabilities and do an
approximate classification by hand.

Sukhotin's algorithm can be seen as an instance of those clustering
algorithms, specialized for the case k = 2. Granted, its goal function
f() probably can't be expressed in terms of dnext() and dprev().
However, to a very rough approximation, we can say that Sukhotin's
procedure tries to optimize some symmetric mixture of the two. That
is, for most languages, two symbols will end up in the same Sukhotin
class if their next- and previous-symbol distributions, taken
together, are sufficiently similar.

The justification for Sukhotin's algorithm, viewed in this light, is
that we expect the discrepancies dnext() and dprev() to be much larger
between a vowel and a consonant than between two vowels or two
consonants --- if not for all languages, at least for vowel-rich
ones like Spanish and Italian. In particular, for two vowels X and Y,
the measure dnext(X, Y) will be lower than normal because both will
tend to be be followed by the same subset fo letters (the
consonsants), and ditto for dprev(X, Y). Therefore, it should not
matter much whether we use dnext, dprev, or some combination of the
two: the strongest split should be the V-C one in any case.

VOYNICHESE SYMBOL CLASSES ARE WEIRD

Well, that isn't the case of Voynichese, at least not with the
standard alphabets. The first problem is that, with either metric, the
widest splits occur between classes that are too small to be vowels,
or are unlikely for other reasons. For instance, in the dnext()
metric, the EVA letters { <y>, <n>, <m> } are very similar, and very
different from the rest; however, that is simply because they
almost always occur in word-final position, so that the next
"character" is almost always a space. That is not how vowels (or
consonants, for that matter) behave in most European languages.
As another example, the four gallows { <k> <t> <p> <f> } letters also
constitute a well-defined prev-class; but they cannot be the vowels,
because there is at most one gallows in each word, and half of the
words dont have any gallows.

In fact, looking at the dnext() metric, one can readily see five or so
well-distinguished classes, that occupy rather constrained positions
with each word. These constraints are the basis for the crust-mantle-core model,
and there doesn't seem to be anything like it in European languages.

If the bizarre word structure was the only problem, we could get
around it by looking at more exotic languages. Indeed, standard
written Arabic has symbols that are constrained to occur only at the
beginning or at the end of a word. And if the "words" are actually
Chinese syllables with tones, we would indeed expect to see four of five
classes with fairly strict constraints about glyph order.  But...

VOYNICHESE SYMBOL CLASSES ARE INCONSISTENT

... then we have another big problem: the two metrics dnext() and
dprev() are quite different, and the classes defined by one cut across
the classes of the other.

For instance, looking at the /next-symbol/ distributions, we will find
that EVA <a> and <i> are fairly next-similar, because they all tend to
be followed by the a small set of letters (<i> chiefly, then <l>, <r>,
<n>, <m> in varying amounts). On the other hand, <a> and <y> are
extremely next-dissimilar, because <y> is hardly ever followed by
those letters. However, if we look the otehr way, we will see that <a>
and <y> are extremely prev-similar (preceded by space, <d>, <o>, <e>,
<ch>, <sh>, <l>, <r>, <s>, and the gallows), but very unlike <i>
(which can be preceded only by <a>, <i>, and <o>).

This inconsistency explains why Sukhotin's algorithm --- which, if you
recall, looks at both sides of each letter at once --- failed to
produce a convincing. While the Suppose we ask someone to divide a
deck of cards in exctly *two* piles, such that each pile is as
homogeneous as possible in *both* suit and value. The result will
necessarily be very inconsitent in both criteria: in the "low, black"
pile he will have to put many red-low and many black-high cards.

And indeed, while Sukhotin's algorithm consistently classifies as vowels the
symbols <e>, <a>, <o>, <y> (which *are* indeed either prev-similar or
next-similar to each other, and play vowel-like roles in the layer
model), it also reports a few odd symbols like <s>, <p>, and <n>,
which cannot possibly fit in the same class as those.  Moreover,
as said above, <a> and <y> are similar on one side only, and quite
unlike on the other.  Change the relative weights of dnext() and dprev()
slightly, and the { <e>, <a>, <o>, <y> } class will fall apart.

SO WHAT IS NEW?

Those problems are probably old stuff. Anyone who has taken the time
to look at the next-letter and previous-letter tables must have
noticed the clear partition into the traditional letter classes
(gallows, dealers, etc.), but also must have become aware of (and
frustrated by) their peculiar positional constraints, and the
left-right inconsistency.  I myself have been chewing at that
bone since I discovered the VMS.

What I found out recently is that tehere seems to be some logic in the
left-right asymmetry after all. As you all know, most Voynichese
symbols are written with one or two two pen strokes, which we can
label "left" and "right" in writing order. In particular, the dealers
and circles are the result of combining either an <e> or <i> stroke on
the left, with one of eight strokes (plumes, flourishes, ligature, or
empty) on the right. Indeed, all but one of the 16 possible
combinations are valid characters:

0  1  2  3  4  5  6  7  8
+----------------------------
i  |  i  ii *  l  r  j  m  n  C
|
e  |  e  a  o  y  s  d  g  b  I

The four gallows are similarly composed of two left halves and two
right halves, in all combinations. (Benches and platform gallows are
combinations of three or more strokes; let's leave them aside for the
moment.)

Now, the funny thing is that the next-symbol and prev-symbol
distributions seem to be determined in large part by the symbol's left
stroke and right stroke, respectively. In other words, symbols on the
same row of the table tend to be prev-similar, while symbols on the
same column tend to be next-similar.

Thus, for example, <a> and <y>, which have an <e> stroke on the left,
are fairly prev-similar but very next-different; while <a> and <i>,
which are have an <i> stroke on the right, are very next-similar but
quite prev-different.

In fact, this rule seems to explain most of the inconsistencies and
tantalizing hints that used to drive me crazy, such as the
relationship between the dealers { <d> <l> <r> <s> }, the finals { <n>
<m> <g> }, and the circles { <a> <o> <y> }. Now it seems almost
logical, for example, that <l> and <y> can be prefixed to gallows, but
<r> and <s> can't; while, on the other hand, <d> and <s> may follow
<e>, while <l> and <r> may not. Also, it makes sense that all the four
gallows <k> <t> <p> <f> are prev-similar, and yet <k> <t> are quite
next-different from <p> <f> (the latter are never followed by <e>).

To be sure, the above "stroke harmony" rule is not entirely
consistent. For instance, <n> and <m>, which both have an <i> stroke
on the left, are fairly prev-different --- <n> is almost always
preceded by <i>, while <m> (like <r> and <l>) are more often preceded
by <a>. But there is evidence anyway that <in> is a single letter, in
which case the inconsistency may well not exist. Also, <d> and <s> are
not as prev-similar to <a> as the theory would predict.

WHAT DOES IT MEAN?

There must be many theories that would explain this bizarre stroke
harmony rule. Here is only one of them. Suppose that instead of
writing the word "HEREBY" like this

#   #    #####    ####     #####    ####     #   #
#   #    #        #   #    #        #   #     # #
#####    ####     ####     ####     ####       #
#   #    #        #  #     #        #   #      #
#   #    #####    #   #    #####    ####       #

a scribe were to detach the last stroke of each letter
and attach it to the following letter, like this:

#   #    #####    ####     #####    ####     #   #
#   #    #        #   #    #        #   #     # #
#   # #######     ####     ####     ####      ##
#   #    #        #      # #        #          ##
#   #    #    #####       ##    #####       ####

Note that there is now a strong correlation between the last few
strokes of each "letter" and the first stroke of the next one, as in
Voynichese --- simply because they were originally parts of the same
letter. (for instance, after a symbol with an "F"-like right side, one
would often find one with a "foot" stroke on the left side.) Note also
that different letters become equal, and equal letters become
different. Note also the tokens would continue to obey Zipf's law, but
all letter statistics would be fairly messed up.

Yet, no information has been lost; and, with a little practice, our
scribe could learn to write and read this "grammar-school cipher" as
quickly and naturally as plain text. However, even without realizing
it, he would have thrown a big monkey wrench into Sukhotin's algorithm
--- and, in fact, into most of our sophisticated statistical
methods...

I'd better stop here, and try get some sleep. All the best,

--stolfi