[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

VMs: VMS: more past stuff: "On som properties of informative texts"



Again an article by B. V. Sukhotin:


         ON SOME PROPERTIES OF INFORMATIVE TEXTS

Let us call "informative" texts which are meant to be deciphered.

Note that this concept is different from that of "meaning" proper,
for a text which carries a meaning is not necessarily meant to be
deciphered without a key. Consequently, meaningless and undecipherable
texts, by definition, cannot be informative.

Since encryption is relatively rare, a characteristic feature of texts
in a natural language will be their high degree of INFORMATIVENESS.
I shall not attempt to formalize this concept any further, as I only
wish to suggest directions for further research.

Take for instance  a problem very similar to the partitioning of a
text into its constituent morphemes. Imagine a text encoded using
n-tuples of integers, so that each letter corresponds to one n-tuple.
Suppose n known. The problem is: find the beginning of the n-tuples.

It is obvious that (supposing the text circular or its beginning and
end missing, so that the solution is not trivial) there are only
n classes of starting positions, since each starting point is
equivalent to the one obtained moving nm positions left or right
(m being an integer). Thus the set of acceptable solutions is
defined: it is the set of partitions of the text into n-tuples,
and this set has n elements.

How can the quality of such a partition be judged?

A random text can be considered to be a sequence of symbols produced
by randomly drawing symbols from a box. The probabilities of the
different symbols must be considered equal and so must be those of the
corresponding n-tuples (if they were not, there would be less reason
to consider the text to be completelty random). It seems obvious that an
informative text must differ considerably from a random text.
Several functions can be proposed to evaluate its degree of
randomness, for instance, the sum of the absolute deviations
from the mean probability, the sum of their squares, or the
first-order entropy of the text, or its infinite-order entropy.

Choosing between those functions is difficult, even though infinite-order
entropy seems best. At any rate, since they are probably all roughly
equivalent for the use to which we are about to put them, the simplest
one will be used here.

Let us encode the Roman alphabet into ternary 3-tuples, i.e.:
a=000, b=001, c=002, d=010, e=011, f=012, g=020, etc...

The short Latin text "Donec eris felix, multos numerabis amicos,
tempora si fuerint nubila, solus eris" becomes:

010111110011002011121022122012011101022210102201101200111122
110201102011121000001022122000102022002111122200011102112111
121000122022012201011121022110200110201001022101000122111101
20112201121022122

We have:

.           Number of 3-tuples occurring N times
.                   in partition
.
.           P1            P2         P3
.N
.
.0          10             7          6
.1           3             6          5
.2           3             5          5
.3           1             3          4
.4           4             0          0
.5           3             2          2
.6           0             1          2
.7           1             2          1
.8           2             0          1
.9           0             0          0
.10          0             0          0
.11          0             1          0


The objective function, here the sum of the absolute deviations from
the mean frequency, should be very small for a truly random text.
Now, since an erroneous partition should resemble a random text more
than a correct partition should, the maximum value of the objective
function should correspond to the correct partition.

In fact, the sum of the absolute deviations from the mean frequency is
60 for partition P1, 53 for partition P2, and 49 for partition P3, and the
maximum sum, 60, does correspond to the correct partition, P1.

But the four possible objective functions suggested above take
an extreme value when symbols are cyclically repeated throughout
the text. Such texts are clearly not what our intuition sees as
informative. Therefore, none of these functions provides a measure
of the informativeness of texts.

Assume now that a mostly random text is given. Whichever two possible
partitions are considered, the corresponding values of the objective
function (whether it is the sum of the absolute deviations, of their
square, or the entropy of any order) can be expected to be very close.

If, on the other hand, a mostly non-random text is given, the values
of these same functions calculated on any two partitions will be just
as close as for a random text.

The above observations suggest that, when an informative text is
encoded as in our example, the difference of quality between the best
and the worst partition can only be infinitely small.

Consider now a text somehow encoded.

The resulting encoded text can be considered to be one particular
interpretation of the original.

Assume that some objective function has been defined by which
the quality of each possible interpretation can be evaluated.

Then, it possible to produce a "very good interpretation" of an informative
text, as well as a "very bad" one.

Now, it seems intuitively true that an informative text can be either
"well" understood or "badly" understood. On the other hand, a random
text can only be "badly" understood, and an obvious text (a periodic
sequence of symbols, for instance) can only be "understood". Here,
"understanding" means being able to predict what follows on the evidence
of some part of the text.

If so, it would seem natural to try and evaluate to what degree a text
is informative from the difference of quality between its best and its
worst interpretation:

  Q(Informativeness) = Q(Best interpretation) - Q(Worst interpretation)

If "partial interpretation" is defined as "acceptable solution of
a decipherment algorithm", it is now possible to answer this question:
what is the property of a text which maximizes its partial informativeness?
(by which is meant the difference between the quality of the best and
the worst acceptable solution). The solution to this problem must lie in
the computation of the distributions of the n-tuples of symbols for a
class of interpretations which maximize partial informativeness, or
in the computation of some statistical properties of those interpretations,
such as moments or redundancy. Those are the properties by which it
should be possible to ascertain to what degree a text is possessed
with the property of being informative.



______________________________________________________________________
To unsubscribe, send mail to majordomo@xxxxxxxxxxx with a body saying:
unsubscribe vms-list