[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
VMs: VMS: Old Stuff

To: vms-list@xxxxxxxxxxx
Subject: VMs: VMS: Old Stuff
From: Jacques Guy <jguy@xxxxxxxxxxxxxxxx>
Date: Wed, 09 Jul 2003 15:58:13 +1000
Reply-to: vms-list@xxxxxxxxxxx
Sender: owner-vms-list@xxxxxxxxxxx
Since there are many new members, who certainly will never
wade through the megabytes of archives, here is an old post
which I found on a very dusty corner of my hard disk (do
disks have corners?). Oh, do set your mail reader to a Courier
font, otherwise it won't display properly.

This is one of Boris Viktorovich Sukhotin's decipherment 
algorithms, which I had adapted from a French translation
published in T.A. Informations in 1974. My comments are
in (*...*)


             PARTITIONING A TEXT INTO ITS MORPHEMES


(* The algorithm effectively partitions a text into its
morph components, not morphemes, but it will turn out to be far
more powerful than that: it can be expected to do a morphological
analysis of an unknown text. "Morph" should be read for "morpheme"
whenever the latter occurs in the text *)

We will assume that the text to be partitioned has no spaces
to separate words and we also assume that the spelling used is
phonemic.

(* The former assumption makes the problem more general and somewhat
more difficult. There is no need for the latter: the algorithm
will apply to strings of ideograms as well as to strings
of phonemes equally successfully *)

As in the case of the consonant/vowel algorithm, we start by
defining the set of acceptable solutions.


SET OF ACCEPTABLE SOLUTIONS

A partition of a text into disjoint subsets is an acceptable
solution.

A few definitions need to be introduced before going further.

A text is a mapping of an alphabet A (a set of letters) onto a
set of locations L. In other words, a text is a set of 2-tuples
of the type (a,p) where  a  is a letter and  p  its position in
the text.

A string Str(s,t) (where s and t are integer numbers and s>=0)
is a set of 2-tuples of the type {(a[i],p[j])} in which s<p[j]<=t.
A text T is a string Str(0,p), p being its length.

(* The notation is somewhat unclear; a[i] is presumably the ith letter of
the alphabet, p[j] the position in T of the jth letter of Str(s,t).
That definition of a string allows for disconnected strings (eg.
"dfton" is a string of the text "definition": {('d',1),('f',3),
('t',7),('o',9),('n',10)} ). Whether Suhotin actually intended it
to be so is unclear. The definition of a text as a string
Str(0,n) of length n is counterintuitive; Str(1,n) would seem
more appropriate. It does allow for a null text Str(0,0),
yet there is no point is partitioning a null text *)

A partition P of a text Str(0,n) is a set of strings {Str[i](s,t)}
such that to each string Str[i](s,t) for which t<>n one can associate
a string Str[i+1](t,v). Str(s,t) and Str(v,w) have a non-empty
intersection if s<v<t<w or if v<s<w<t. (Since strings are sets,
we can use the terminology of set theory and say that a text is
included in another or that two texts have a non-empty intersection.)

Suppose that some relation of similarity on the set of strings of
text T has been defined.

A morpheme m is the set which contains some string Str(s,t) of T
and all the all strings of T similar to Str(s,t).

Let M be the set of the morphemes of T. Given an arbitrary partition
P of T, some strings of m are also strings of P, and some are not.
We say that the strings of m which are also strings of P are true
occurrences of the morpheme m relatively to P, and those which are
not are false occurrences of m.

If the partition P does not contradict our intuition we can say that
those occurrences are true or false absolutely rather than relatively
to a partition.

For instance, "pint" is a true occurrence of the morpheme "pint"
in 1) below, but it is not in 2), and "ilk" is a false occurrence
of the morphem "ilk" in 1):

1) |A|pint|of|milk|
2) |Keep|in|touch|

A morpheme is true if there is at least one absolutely true
occurrence of it in T.

A definition of "acceptable solution" now emerges: an
acceptable solution consists of:

1) a partition of the text
2) a set of morphemes

Let us now turn to the problem of defining a similarity relation.

A possible definition is:

Two strings Str(s,t) and Str(v,w) are similar when one can
be obtained from the other by translation, that is, if
Str(s,t) = { (a[x],p[j]), (a[y],p[j+1]), ... } and
Str(v,w) = { (a[x],p[y+c]), (a[y],p[j+1+c), ... } in which
c is an integer constant.

A morpheme m is then conveniently represented by the string
Strm(0,n) of length n, similar to any string of m and consisting of
2-tuples (a[x],p[y]) in which 0<p[y]<=n. A morpheme m[i]
alphabetically contains another m[j] if the string Strm[i]
(* which is a set *) contains the string Strm[j].

Given the notion of similarity as introduced above, a partition
uniquely defines a set of morphemes and the set of acceptable
solutions is therefore the set of partitions.

Having defined the set of acceptable solutions, we still have to
find an adequate objective function, then an optimal solution.
But we are now confronted with an interesting phenomenon. Consider the
following examples:

T:   innumerabilis
P1:  |in|numer|abil|is|
P2:  |in|numerabil|is|
P3:  |in|numerabilis|

It is apparent that the three partition are all, in a way, "right";
P1 only seems finer than P2 and P2 finer than P3. Nor can this partition
of a longer text be considered wrong:

P: |innumerabilis|annorum|series|et|fuga|temporum|

This is all the more remarkable that the strings which compose P2,
P3, and P are linguistically meaningful objects: they are roots, words,
or more simply, combinations of morphemes which behave like
isolated morphemes. This is the reason why partition P, for instance,
is not merely an approximation of P1; it has its own significance
because the strings which it evidences are words. It is obvious
too that P is more meaningful than P4 below, even though
P4 is also composed of strings which are combinations of simple
morphemes:

P4: |innumer|abilisannor|umserie|set|fugatempor|um|

We obtain thus a model of morphological analysis consisting of a series
of coarser and coarser partitions, each one of which is right in its own
way, and can be considered to be the approximation of other, finer, ones.

We say that Pi is coarser than Pj if any string of Pj is included
in a string of Pi, but no string of Pj is included in a string of
Pi.

An acceptable solution, which is at the same time a morphological
analysis, is then a series of partitions P0, P1, ...
Pj, ....Pn, each one of which (P0 excepted) is finer than the
previous one.

Such a morphological analysis is conveniently represented by
a tree (similar to that of immediate constituent analysis), or
by bracketing, eg.:


.1)
.               .----------------.
.               |                |
.               |           .--------.
.               |           |        |
.               |       .------.     |
.               |       |      |     |
.             .--.   .-----. .----. .--.
.             |in|   |numer| |abil| |is|


2)            ((((in)))(((numer)(abil))((is))))


A morphological analysis is of course an operation more complex
than a partition. It corresponds, however, to a more simple definition
of a morpheme:

    A morpheme is a set of similar strings the intersection of
    which is empty.

It follows from this definition that any morphological analysis
includes the finest partition Pn, into isolated letters, since there
is no one-letter string which does not have a non-empty intersection
with another string. Likewise, it includes the coarsest partition P0,
which is the whole of the text.

This new definition of an acceptable solution does not entail a
particularly more complex recognition procedure.

Note, however, that the definition of similarity is clearly wanting.
It prevents the algorithm from coping with

1) Sandhi-type morphemic alternances, eg. silex/silicis
2) Infixation, eg. rumpo/ruptus
3) Internal flexion, eg. facio/feci
4) Stress shifts, eg. orno/ornare

This is nothing disastrous, however. The first point at least
(* sandhi-type morphemic alternances *) can be fixed with a finer
definition of similarity relation.

(* Probably not. It is difficult to conceive how a similarity
relation allowing for sandhi could be other than language-
specific.  But if we read "morph" instead of "morpheme", the
problem disappears: the algorithm partitions a text into its
morphs; another algorithm must be devised to group them into
morphemes *)

Before we turn to the objective function, we must generalize
some notions already mentioned and introduce new ones.

A string Str which is a member of a morpheme m is a true
occurrence of m if it is also a member of at least one
partition of the morphological analysis A (or in other
words this string is true for that particular analysis).
Consider now a partition of this string. This partition is
immediate if it consists of true occurrences the number of
which is minimum.


THE OBJECTIVE FUNCTION

Since the number of different possible morphological analyses of any
text of reasonable length is so large, the definition of morpheme
elaborated above is wanting. It lacks a criterion of quality.

A morpheme is set of similar strings which occur REPEATEDLY in
a text. Its members must then be, in some way, stable (this,
however, is only true of those strings which are true occurrences
of the morpheme).

Suppose a measure of stability defined. The value of any
given morphological analysis can then be estimated by the
sum of the stabilities of the morphemes identified in the
process. If the measure of stability is good, the higher the
sum, the better the analysis, with the best analysis
yielding a maximum sum.

Let us consider some possible measurements of stability:

1) The stability of a morpheme is equal to the absolute
   frequency of its true occurrences.
   A seemingly attractive definition because of its simplicity,
   but unsatisfactory for short strings of high-frequency letters,
   which may occur frequently by pure chance.

2) The difference between the expected relative frequency of the
   strings of a morpheme and the observed relative frequency of its
   true occurrences.
   (* No reason is given for not using this measurement. *)

3) A function based on the following properties of morphemes:
   a) the presence of a part of a true occurrence announces the
      the presence of the rest of the morpheme with a high
      degree of likelihood
   b) true occurrences of morphemes are likely to be found
      in very different environments.

  (* Point b) possibly inspired from Zelig Harris's  "From
   phoneme to morpheme" in Language, v.21 No.3, 1945 *)


Call the first property (a) internal stability, the second (b)
external stability.

Consider the internal stability of a string Str(0,6) = "passer"

Suppose that the immediate partition of this string is of the
form:

    (p)(a)(s)(s)(e)(r)

List all the possible partitions of this string into two
connected substrings:

P1 (p)|(a) (s) (s) (e) (r)
P2 (p) (a)|(s) (s) (e) (r)
P3 (p) (a) (s)|(s) (e) (r)
P4 (p) (a) (s) (s)|(e) (r)
P5 (p) (a) (s) (s) (e)|(r)

It is clear that wherever "passe" occurs in a Latin text,
"r" is likely to occur next, and that wherever "asser" is found,
"p" is likely to precede.

Represent the string "passer" onto which partition Pi has
been effected by Si,S'i in which S is its left part
and S'i its right part.

Let the degree of attraction of S'i by Si be measured by the
fraction f(Si,S'i)/f(Si) and that of Si by S'i by f(Si,S'i)/f(S'i),
f being here either the relative or the absolute frequency.

(* That is, the degree of attraction of the right part by the
left part is equal to the number of occurrences of left and
right parts together divided by the total number of occurrences
of the left part, with or without the right part *)

The highest degree of attraction is 1 and the lowest is 0.

Let the internal stability of the string be the mean attraction
of all its partitions and strings of length 1 (single letters)
have a stability of 0 by definition.

External stability can be estimated roughly on a binary scale:
complete stability and complete instability. Now, the absolute
frequency of a morpheme m represented by a string S which
alphabetically includes another, longer string S' itself
representing a morpheme m' cannot be greater than that of m'.
Hence, if m and m' occur with exactly the frequency in the text,
m' is, as it were, "immersed" in m. By definition, the external stability
of a morpheme is 0 if it is "immersed" in another, otherwise it is 1.

Let the stability of a morpheme be the product of
its internal and external stabilities, and the stability
of a morphological analysis the sum of the stabilities
of its morphemes.

(* It follows from those definitions that the stability of
the finest (single letters) is zero. This is intuitively
agreeable, for this partition is trivial *)


RECOGNITION PROCEDURE

The optimal morphological analysis of the text and the list of its
true morphemes are obtained simultaneously.

A list of all the morphemes of a text is a gross approximation of
the list of its true morphemes (it is much longer and contains
many morphemes which are mostly false occurrences). Call this
first approximation L0.

A better approximation, L1, is obtained by eliminating morphemes
of stability 0.

L1 is much shorter L0. Unfortunately at this stage L1 lacks true
morphemes of zero stability but re-entering them later is quite easy.
L1 is convenient, for it contains all the morphemes the frequency
of which is necessary to calculate the stability of any morpheme.

The algorithm consists of five steps.


STEP 1

  Build a list L2 of all the morphemes of absolute frequency
  greater than 1, with each entry in the list consisting
  of the string representing the morpheme, and of its absolute
  frequency.
  Since any morpheme with an absolute frequency of one has
  a stability of zero L2 is shorter than L0 and longer than
  L1 as far as morphemes with a stability of 0 and a frequency
  greater than 1 (* eg. one-letter words *) are concerned.

  L2 can be obtained by the following procedure:

  a) List the alphabet of the text, calculate the absolute
     frequency of each letter. Remove entry with a
     frequency of one. This list is the first fragment,
     or fragment of rank 1, of L2. Represent it by L2(1).

  b) Given a fragment of rank n, the fragment of rank n+1 is obtained
     as follows: build the strings of the form Str(0,n)+Str(n,n+1)
     in which Str(0,n) is a string of L2(n) and Str(n,n+1) a string
     of L2(1); n, transposed, moves to the right; calculate their
     frequencies and eliminate the strings with frequencies not greater
     than 1.

     The procedure terminates when we obtain L2(n) for which L2(n+1)
     is empty. The list L2 is of the sum of these fragments.

     (* In far simpler terms: given a text of length n, list
     all the substrings of it from length 1 to n which occur
     more than once *)

STEP 2

  Calculate the stability of each string of L2.

  (But remember that at this stage only strings of length 1 can
  be considered to constitute true occurrences of morphemes.)

STEP 3

  This  step is based on the assumption that all the
  occurrences of the morpheme with the highest stability
  (greater than zero) are true. Granted this assumption,
  no string which has a non-empty intersection with an
  occurrence of this maximum stability morpheme can be a
  true occurrence of a morpheme.

  Therefore, select the morpheme of L2 with the highest
  stability, and bracket its occurrences in the text
  (* thus carrying out morpheme recognition and
  morphological analysis concurrently *)

  The frequency of every morpheme of L2 with a
  non-empty intersection with N occurrences of the
  selected morpheme is reduced by N. Some false occurrences
  are thus discarded. When during this process the frequency
  of any given morpheme falls below 2 it is eliminated from
  the list.

  Since the status (true or false occurrence) and
  the frequency of strings varies during the procedure, the
  stability of the rest of the remaining morphemes is then
  calculated anew.

  The process is repeated until only morphemes with zero
  stabilities are left.

STEP 4
  We now have a partition of the text expressed by bracketing.
  This partition, however, does not yet represent a morphological
  analysis.

  But first a few definitions.

  A pair of brackets which contains no other brackets is of
  rank 1. A pair of bracket which contains other brackets,
  the highest rank of which is i-1, is therefore itself of
  rank i. A segment of text which is not between
  brackets of rank 1, does not contain brackets, and is
  between brackets of any rank not necessarily forming a
  pair, is considered to be between brackets of rank 0
  (not represented in the text). A segment between two
  brackets of rank i is a segment of rank i.
  A pair of brackets is AUTONOMOUS for rank r if the lowest rank
  of the brackets by which it is contained is r.

  Find the maximum bracket rank of the text. Call it r. Consider
  the text to be between (invisible) brackets of rank r+1.
  Enclose all brackets pairs autonomous for rank r+1 with
  new bracket pairs until the rank of the enclosing brackets
  is equal to r.

  Repeat for all segments of rank r down to 2.

  The morphological analysis -- represented by bracketing
  segments of the text -- is now complete.

STEP 5
  The segments which appeared in brackets for the first time
  at step 4 are true occurrences of zero-stability morphemes
  (eg. morphemes having only one string for member)
  and are added to L2.






______________________________________________________________________
To unsubscribe, send mail to majordomo@xxxxxxxxxxx with a body saying:
unsubscribe vms-list
Follow-Ups:
- Re: VMs: VMS: Old Stuff
  - From: steve ekwall
Prev by Date: RE: VMs: Re: word length counts
Next by Date: VMs: VMS: more past stuff: "On som properties of informative texts"
Previous by thread: VMs: An omission
Next by thread: Re: VMs: VMS: Old Stuff
Index(es):
- Date
- Thread