[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Reinterpreting the LSC (long)
Mark, I haven't had the time to read your papers carefully enough,
but the idea of the LSC seems quite interesting. Here are some of my
comments.
Why should the LSC work?
In a very broad sense, the LSC and the nth-order character/word
entropies are trying to measure the same thing, namely the
correlation between letters that are a fixed distance apart.
People have observed before that correlation between samples n steps
apart tends to be higher for "meaningful" signals than for "random"
ones, even for large n. The phenomenon has been observed in music,
images, DNA sequences, etc. This knowledge has been useful for, among
other things, designing good compression and approximation methods for
such signals. Some of the buzzwords one meets in that context are
"fractal", "1/f noise", "wavelet", "multiscale energy", etc. (I
believe that Gabriel has written papers on fractals in the context of
medical imaging. And a student of mine just finished her thesis on
reassembling pottery fragments by matching their outlines, which turn
out to be "fractal" too.)
As I try to show below, one can understand the LSC as decomposing
the text into various frequency bands, and measuring the `power'
contained in each band. If we do that to a random signal, we will
find that each component frequency has roughly constant expected
power; i.e. the power spectrum is flat, like that of ideal white
light (hence the nickname `white noise'.) On the other hand, a
`meaningful' signal (like music or speech) will be `lumpier' than a
random one, at all scales; so its power spectrum will show an excess
power at lower frequencies. It is claimed that, in such signals, the
power tends to be inversely proportional to the frequency; hence the
moniker `1/f noise'.
If we lump the spectrum components into frequency bands, we will
find that the total power contained in the band of frequencies
between f and 2f will be proportional to f for a random signal, but
roughly constant for a `meaningful' signal whose spectrum indeed
follows the 1/f profile.
Is the LSC better than nth-order entropy?
In theory, the nth-order entropies are more powerful indicators of
structure. Roughly speaking, *any* regular structure in the text
will show up in some nth-order entropy; whereas I suspect that one can
construct signals that have strong structure (hence low entropy)
but the same LSC as a purely random text.
However, the formula for nth-order entropy requires one to estimate
z**n probabilities, where z is the size of the alphabet. To do that
reliably, one needs a corpus whose length is many times z**n. So the
entropies are not very meaningful for n beyond 3 or so.
The nth-order LSC seems to be numerically more stable, because it maps
blocks of n consecutive letters into a single `super-letter' which
is actually a vector of z integers; and compares these super-letters
as vectors (with difference-squared metric) rather than symbols
(with simple 0-1 metric). I haven't done the math --- perhaps you
have --- but it seems that computing the n-th order LSC to a fixed
accuracy requires a corpus whose length L is proportional to z*n (or
perhaps z*n**2?) instead of z**n.
Morever, one kind of structure that the LSC *can* detect is any
medium- and long-range variation in word usage frequency along the
text. (In fact, the LSC seems to have been designed specifically for
that purpose.) As observed above, such variations are present in
most natural languages, but absent in random texts, even those
generated by kth-order monkeys.
Specifically, if we take the the output of a k-th order `letter
monkey' and break it into chunks whose length n >> k, we will find
that the number of times a given letter occurs in each chunk is
fairly constant (except for sampling error) among all chunks. For
kth-order `word monkeys' we should have the same result as long as
n >> k*w, where w is the average word length. On the other hand, a
natural-language text will show variations in letter frequencies,
which are due to changes of topic and hence vocabulary changes, that
extend for whole paragraphs or chapters.
Thus, although the LSC may not be powerful enough to detect the
underlying structure in non-trivial ciphers, it seems well suited at
distinguishing natural language from monkey-style random text.
Another view of LSC
Someone -- I think it was Goethe -- observed that mathematicians
are like Frenchmen: whatever you say to them, they immediately translate
into their language, and it then becomes something else entirely.
Although I am not a mathematician, I cannot resist the chance
to translate your LSC definition into my own brand of `French.'
Here it goes:
STEP 1 of the LSC computation consists in replacing each letter Y by
a vector of z zeros and ones, where the r-th component of the
vector is 1 if and only if Y is the r-th letter of the alphabet.
Thus the original text is replaced by z binary strings of the same
length. Steps 2 and 3 below are applied to each
of these z binary strings, separately.
STEP 2 takes a binary string y(i), i = 0..L-1 and produces a string
of integers Y^n_j, j = 0..L/n-2, by the formula
Y^n_j = \sum_{i=0}^{L-1} h^n_j(i) y(i), (1)
where h^n_j() is the `kernel function'
{ +1 for jk \leq i < (j+1)n,
h^n_j(i) = { -1 for (j+1)n \leq i < (j+2)n, (0)
{ 0 for all other values of i.
In words, h^n_j() is a train of n `+1's followed
by n `-1's, all surrounded by a sea of zeros.
(Here "^n" is merely a superscript index, not a power.)
STEP 3 takes the numbers Y^n_j and computes the sums
S^n = \sum_{j=1}^{L/n-2} Y^n_j**2 (2)
STEP 4 adds together the sums S^n_x obtained for
different letters x, into a single sum S^n_tot.
LSC as multiscale signal analysis
The point of using this convoluted formalism is that formula (1) can
be interpreted as a `scalar product' between the `signal' y() and
the `kernel' h^n_j(). Except for a constant factor, the coefficients
Y^n_j is therefore a measure of `how much' of h^n_j() is contained
in the signal y().
This interpretation of formula (1) is not restricted to the
particular kernel functions h^n_j of definition (0), but can be used
with any set of kernels h_r(); in that case, we would write
Y_r = \sum_{i=0}^{L-1} h_r(i) y(i) (1')
This interpretation becomes more interesting when the kernel
funtions h_r() are orthonormal, i.e. \sum_i h_r(i) h_s(i) = 0 for r
different from s, and \sum_i h_r(i)**2 = 1. Then the signal
y~(i) = \sum_r Y_r h_r(i)
is the combination of kernel functions that best approximates the
signal y(). In particular, if every signal y() is a linear
combination of kernel functions (i.e. the kernels form a basis of
signal space), the numbers Y_r are the coefficients of that
combination.
The decomposition of a signal y() into a combination of certain
kernel functions is a basic tool of signal processing. For instance,
in Fourier theory the kernels h_r are sinusoids and co-sinusoids on
the parameter i with various frequencies, and the numbers Y_r are
the corresponding Fourier coefficients.
The Fourier kernels have many nice properties, but they extend over
the whole i range; thus changing a single sample y(i) will change
all the coefficients Y_r. For that reason, in some applications
people prefer to use kernels that have a bounded support --- i.e.
they are zero, or practically zero, outside of a small range of the
i parameter. A set of such kernels is called a {\em spline basis}.
However, if Fourier kernels suffer for being too `global', spline
kernels suffer for being too `local': in order to approximate a
broad hump y() one nees to add many narrow bumps. {\em Wavelets}
were invented to fill the gap between these two extremes.
Note that all the functions h^n_j(j) defined by formula (0) are just
like the function h^1_0(), only stretched by a factor of n and shifted
by n*j. This scale-and-shift property is basically the definition of a
`wavelet family'. The parameter n is called the `scale' of the
wavelet, and the index j could be called its `phase'.
Wavelet theory is based on the observation that, if one picks a `mother
wavelet' h^1_0() with the right shape, then the following wavelets are
mutually orthogonal and span the space of all signals:
h^1_0, h^1_1, h^1_2, h^1_3, h^1_4, h^1_5, ...,
h^2_0, h^2_1, h^2_2, h^2_3, h^2_4, h^2_5, ...,
h^4_0, h^4_1, h^4_2, h^4_3, h^4_4, h^4_5, ..., (3)
h^8_0, h^8_1, h^8_2, h^8_3, h^8_4, h^8_5, ...,
Note that the scale n goes up in geometric progression, while the
indices j are consecutive. The wavelets with intermediate scales --
such as h^3_0 -- are redundant, in the sense that the corresponding
coefficients Y^n_j can be computed from the coefficients of the
wavelets listed above. Thus, for a signal of length L, we need only
L/(w*n) wavelets of scale n, where w is the actual shift between
h^1_0 and h^1_1. Therefore, the number of terms in each `layer' goes
down geometrically, and there are only log(L) layers.
The decomposition of a signal into a wavelet basis (3) is still
fairly `local', in the sense that a change in one sample y(i)
affects only O(log(L)) coefficients Y^n_j, corresponding to wavelets
h^n_j() whose support includes the point i. On the other hand, a
broad bump in the signal y() usually gets represented by a couple of
large coefficients Y^n_j, corresponding to wavelets of roughly the
same width and position as the bump, plus smaller `adjustment'
coefficients with smaller scales.
If we decompose a signal y() into non-redundant wavelets, and then
take only the components with a fixed scale n, we are basically
looking at the part of the signal that consists of details of size
~w*n. This is almost the same as decomposing y() into sinusoids of
frequencies 1,2,3,..., and then adding only the components with
frequencies between n and 2n.
So a byproduct of wavelet analysis is a decomposition of the signal
y() into log(L) parts or `bands,' one for each layer of the basis.
If we add these components from the bottom up, after each stage we
obtain a picture of y() that is twice as sharp as the previous one. In
that case, the quantity S^n of formula (2) is the `strength' (or
`power' in the usual jargon) of band n of the signal y().
The kernels of definition (0), which are implied by the original LSC
definition, are NOT orthogonal, but only because they are packed
too tightly. If we redefine them as
{ +1 for 2jn \leq i < 2(j+1)n,
h^n_j(i) = { -1 for 2(j+1)n \leq i < 2(j+2)n, (0')
{ 0 for all other values of i.
(with successive wavelets spaced by 2n rather than n), then all
members of family (3) become pairwise orthogonal, and (except for
scale factors) we get the so-called `Haar wavelet basis'.
The bottom-most layer of the Haar basis has only one half of a
wavelet, whose full width is 2*L; the corresponding coefficient is
proportional to the average of all samples y(i). The next layer has
a single wavelet with period L; its coefficient gives the difference
between the average value of y() in the half-range [0..L/2-1] and
the average value in [L/2..L-1]. The next layer has two wavelets;
their coefficients compare the average value of y() between the
first two quarters of the range [0..L-1], and between the last two
quarters. An so on.
When we add all layers of the Haar decomposition with scales >= n,
we get a staircase approximation to the signal y(), with steps of
width n. Each wavelet in the next layer will split one of these
steps in two steps of width n/2, raising one side and lowering the
other by the same amount.
Variations on the LSC
This reinterpretation of the LSC exposes several choices that
were implicit in the original definition. We can get many
other `LSC-like' indicators by picking different alternatives for
those choices.
In STEP 1 the given text gets converted into one or more numeric
`signals' y(i). This step is perhaps the most arbitrary part of the
definition. Almost any mapping here will do, as long as it is
local --- i.e. changing one letter of the text will affect only
one signal sample y(i) (or at most a few adjacent samples).
In the original definition, each letter is mapped to 1 or 0
depending on whether it is equal to or different from a given
letter x. This simple mapping has the advantage that it makes the
LSC invariant under simple letter substitutions. It may also be
the `most sensitive' mapping under the appropriate definition of
sensitivity. But we could consider other mappings, especially
if we already know something about the alphabet. For instance,
we could map consonants to +1, vowels to -1; or map
each letter to a weight that depends on its frequency over the
whole text.
Note that it is fairly immaterial how we treat spaces,
punctuation, case, diacritics, digraphs, etc. We can treat those
features as letters, delete them, lump them with adjacent letters,
whatever. These choices will normally preserve the long-range
correlations (although they may affect the exact value of those
correlations).
In STEP 2 we analyze the signal y() against some repertoire
of kernel functions h^n_j, and obtain the corresponding
coefficients Y^n_j.
One conclusion we get from the `multiscale analysis'
interpretation is that it is pointless to compute the LSC for all
scales n between 1 and L. It is sufficient to consider only those
scales n which are powers of two. This observation allows us to
cut down the cost of computing the complete LSC plot from O(L**2)
to O(L) --- a 10^5 speedup for typical texts!
The multiscale analysis interpretation also suggests that we could
use other kernel functions in lieu of formula (0). From that
viewpoint, we can see that the shape of the kernels is not
important, as long as the application of formula (1) has the
effect of decomposing the signal y() into frequency bands related
to the scale n. It is not even necessary for the kernels to have
compact support. We could use for instance the Fourier kernel
(sinusoids), numbered so that the kernels with a given scale n
have frequencies between L/n and L/2n.
As observed above, the kernels implied by the original LSC
definition (formula (0)) are redundant when all scales are taken
together, and are not orthogonal. However, they are not redundant
if we consider a single scale n. In fact, the kernels { h^n_j,
j=0..L/n-2 } are linearly independent, and just one degree of
freedom short of being a complete basis; so they do provide good
coverage.
Another problem with the original kernels is that they are not
orthogonal. This would be an inconvenient for other applications,
but it does nor seem to be a problem if th eonly goal is to
distinguish meaningful signals from random noise.
A more serious defect is that the kernels are square-shaped with
sharp discontinuities. Because of such discontinuities, each
`layer' of the implied signal decomposition will cover a rather
broad band of frequencies, and there will be significant overlap
between bands. Unfortunately, I don't know how to explain this
point in few words.
Anyway, the conclusion is that the use discontinuous kernels may
result in `blurred' LSC-versus-n plots, with spurious peaks and
valleys known as `Gibbs effects' or `ringing.' It may be worth
trying smoother kernels, such as gaussians and their derivatives,
or gaussian-sinusoid products.
In fact, you should consider using smoothest possible kernels,
namely the Fourier basis. In other words, instead of the
LSC as defined, you could compute the power spectra of the
signals y().
STEP 3 computes the sum S^n of the squares of the coefficients
Y^n_j, for all j. I can't think of any useful alternative to this
step. However, under the `signal analysis' interpretation, it is
desirable to choose the kernel functions so that they are
orthogonal and have unit norm (i.e. \sum_i h^n_j(i)**2 = 1). These
conditions ensure that the power of the original signal y() is
equal to the sum of all S^n. That will make the value of S^n
more meaningful, and easier to estimate for random texts.
Other comments
In conclusion, my understanding of the Perakh-McKay papers is that
computing the LSC is an indirect way of computing the power spectrum
of the text. The reason why the LSC distinguishes meaningful texts
from monkey gibberish is that the former have variations in letter
frequencies at all scales, and hence a 1/f-like power spectrum;
whereas the latter have uniform letter frequencies, at least over
scales of a dozen letters, and therefore have a flat power spectrum.
Looking at the LSC in the context of multiscale analysis suggests
many possible improvements, such as using scales in geometric
progression, and kernels which are smoother, orthogonal, and
unitary. Even if these changes do not make the LSC more sensitive,
they should make the results easier to evaluate.
In retrospect, it is not surprising that the LSC can distinguish the
original Genesis from a line-permuted version: the spectra should be
fairly similar at high frequencies (with periods shorter than one
line), but at low frequencies the second text should have an
essentially flat spectrum, like that of a random signal. The same
can be said about monkey-generated texts.
On the otherhand, I don't expect the LSC to be more effetive than
simple letter/digraph frequency analysis when it comes to
identifying the language of a text. The most significant influence
in the LSC is the letter frequency histogram --- which is sensitive
to topic (e.g. "-ed" is common when talking about past) and to
spelling rules (e.g. whether one writes "ue" or "ü"). The shape of
the LSC (or Fourier) spectrum at high frequencies (small n) must be
determined mainly by these factors. The shape of the specrtum at
lower frequencies (higher n) should be determined chiefly by topic
and style.
I have only skimmed through the sections about estimating the LSC
for random texts. I didn't see anything wrong, but my impression is
that those sections are overdoing it. For instance, it should not
matter whether the letters are drawn with or without replacements,
or whether the last chunk is complete or not: surely the effect of
these details on the LSC must be less than the sampling error.
(Compare, for instance, the LSC of the first half of Genesis with
that of the second half.) Thus, I believe that these sections could
be much shorter if these details were dismissed right at the
beginning.
All the best,
--stolfi