[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Reinterpreting the LSC (long)



Mark, I haven't had the time to read your papers carefully enough,
but the idea of the LSC seems quite interesting. Here are some of my
comments.

Why should the LSC work?

  In a very broad sense, the LSC and the nth-order character/word
  entropies are trying to measure the same thing, namely the
  correlation between letters that are a fixed distance apart.

  People have observed before that correlation between samples n steps
  apart tends to be higher for "meaningful" signals than for "random"
  ones, even for large n. The phenomenon has been observed in music,
  images, DNA sequences, etc. This knowledge has been useful for, among
  other things, designing good compression and approximation methods for
  such signals. Some of the buzzwords one meets in that context are
  "fractal", "1/f noise", "wavelet", "multiscale energy", etc. (I
  believe that Gabriel has written papers on fractals in the context of
  medical imaging. And a student of mine just finished her thesis on
  reassembling pottery fragments by matching their outlines, which turn
  out to be "fractal" too.) 
  
  As I try to show below, one can understand the LSC as decomposing
  the text into various frequency bands, and measuring the `power'
  contained in each band. If we do that to a random signal, we will
  find that each component frequency has roughly constant expected
  power; i.e. the power spectrum is flat, like that of ideal white
  light (hence the nickname `white noise'.)  On the other hand, a
  `meaningful' signal (like music or speech) will be `lumpier' than a
  random one, at all scales; so its power spectrum will show an excess
  power at lower frequencies. It is claimed that, in such signals, the
  power tends to be inversely proportional to the frequency; hence the
  moniker `1/f noise'.  
  
  If we lump the spectrum components into frequency bands, we will
  find that the total power contained in the band of frequencies
  between f and 2f will be proportional to f for a random signal, but
  roughly constant for a `meaningful' signal whose spectrum indeed
  follows the 1/f profile.
  
Is the LSC better than nth-order entropy?

  In theory, the nth-order entropies are more powerful indicators of
  structure. Roughly speaking, *any* regular structure in the text
  will show up in some nth-order entropy; whereas I suspect that one can
  construct signals that have strong structure (hence low entropy)
  but the same LSC as a purely random text.

  However, the formula for nth-order entropy requires one to estimate
  z**n probabilities, where z is the size of the alphabet. To do that
  reliably, one needs a corpus whose length is many times z**n. So the
  entropies are not very meaningful for n beyond 3 or so.

  The nth-order LSC seems to be numerically more stable, because it maps
  blocks of n consecutive letters into a single `super-letter' which
  is actually a vector of z integers; and compares these super-letters
  as vectors (with difference-squared metric) rather than symbols
  (with simple 0-1 metric). I haven't done the math --- perhaps you
  have --- but it seems that computing the n-th order LSC to a fixed
  accuracy requires a corpus whose length L is proportional to z*n (or
  perhaps z*n**2?) instead of z**n.
  
  Morever, one kind of structure that the LSC *can* detect is any
  medium- and long-range variation in word usage frequency along the
  text. (In fact, the LSC seems to have been designed specifically for
  that purpose.) As observed above, such variations are present in
  most natural languages, but absent in random texts, even those
  generated by kth-order monkeys.
  
  Specifically, if we take the the output of a k-th order `letter
  monkey' and break it into chunks whose length n >> k, we will find
  that the number of times a given letter occurs in each chunk is
  fairly constant (except for sampling error) among all chunks. For
  kth-order `word monkeys' we should have the same result as long as 
  n >> k*w, where w is the average word length. On the other hand, a
  natural-language text will show variations in letter frequencies,
  which are due to changes of topic and hence vocabulary changes, that
  extend for whole paragraphs or chapters.
  
  Thus, although the LSC may not be powerful enough to detect the
  underlying structure in non-trivial ciphers, it seems well suited at
  distinguishing natural language from monkey-style random text.
  
Another view of LSC

  Someone -- I think it was Goethe -- observed that mathematicians
  are like Frenchmen: whatever you say to them, they immediately translate
  into their language, and it then becomes something else entirely.
  Although I am not a mathematician, I cannot resist the chance
  to translate your LSC definition into my own brand of `French.'
  Here it goes:
  
  STEP 1 of the LSC computation consists in replacing each letter Y by
    a vector of z zeros and ones, where the r-th component of the
    vector is 1 if and only if Y is the r-th letter of the alphabet.
    Thus the original text is replaced by z binary strings of the same
    length.  Steps 2 and 3 below are applied to each
    of these z binary strings, separately.

  STEP 2 takes a binary string y(i), i = 0..L-1 and produces a string
    of integers Y^n_j, j = 0..L/n-2, by the formula
    
      Y^n_j = \sum_{i=0}^{L-1} h^n_j(i) y(i),               (1)
      
    where h^n_j() is the `kernel function' 

                 { +1 for jk \leq i < (j+1)n,
      h^n_j(i) = { -1 for (j+1)n \leq i < (j+2)n,           (0)
                 { 0 for all other values of i.
    
    In words,  h^n_j() is a train of n `+1's followed
    by n `-1's, all surrounded by a sea of zeros.
    (Here "^n" is merely a superscript index, not a power.)

  STEP 3 takes the numbers Y^n_j and computes the sums

      S^n = \sum_{j=1}^{L/n-2} Y^n_j**2                     (2)
      
  STEP 4 adds together the sums S^n_x obtained for 
    different letters x, into a single sum S^n_tot.

LSC as multiscale signal analysis

  The point of using this convoluted formalism is that formula (1) can
  be interpreted as a `scalar product' between the `signal' y() and
  the `kernel' h^n_j(). Except for a constant factor, the coefficients
  Y^n_j is therefore a measure of `how much' of h^n_j() is contained
  in the signal y().
  
  This interpretation of formula (1) is not restricted to the
  particular kernel functions h^n_j of definition (0), but can be used
  with any set of kernels h_r(); in that case, we would write
  
      Y_r = \sum_{i=0}^{L-1}  h_r(i) y(i)                   (1')
  
  This interpretation becomes more interesting when the kernel
  funtions h_r() are orthonormal, i.e. \sum_i h_r(i) h_s(i) = 0 for r
  different from s, and \sum_i h_r(i)**2 = 1. Then the signal
  
      y~(i) = \sum_r  Y_r  h_r(i)  

  is the combination of kernel functions that best approximates the
  signal y(). In particular, if every signal y() is a linear
  combination of kernel functions (i.e. the kernels form a basis of
  signal space), the numbers Y_r are the coefficients of that
  combination.
  
  The decomposition of a signal y() into a combination of certain
  kernel functions is a basic tool of signal processing. For instance,
  in Fourier theory the kernels h_r are sinusoids and co-sinusoids on
  the parameter i with various frequencies, and the numbers Y_r are
  the corresponding Fourier coefficients.
  
  The Fourier kernels have many nice properties, but they extend over
  the whole i range; thus changing a single sample y(i) will change
  all the coefficients Y_r. For that reason, in some applications
  people prefer to use kernels that have a bounded support --- i.e.
  they are zero, or practically zero, outside of a small range of the
  i parameter. A set of such kernels is called a {\em spline basis}.

  However, if Fourier kernels suffer for being too `global', spline
  kernels suffer for being too `local': in order to approximate a
  broad hump y() one nees to add many narrow bumps. {\em Wavelets}
  were invented to fill the gap between these two extremes.
  
  Note that all the functions h^n_j(j) defined by formula (0) are just
  like the function h^1_0(), only stretched by a factor of n and shifted
  by n*j. This scale-and-shift property is basically the definition of a
  `wavelet family'. The parameter n is called the `scale' of the
  wavelet, and the index j could be called its `phase'.

  Wavelet theory is based on the observation that, if one picks a `mother
  wavelet' h^1_0() with the right shape, then the following wavelets are
  mutually orthogonal and span the space of all signals:

      h^1_0, h^1_1, h^1_2, h^1_3, h^1_4, h^1_5, ...,
      h^2_0, h^2_1, h^2_2, h^2_3, h^2_4, h^2_5, ...,
      h^4_0, h^4_1, h^4_2, h^4_3, h^4_4, h^4_5, ...,     (3)
      h^8_0, h^8_1, h^8_2, h^8_3, h^8_4, h^8_5, ...,

  Note that the scale n goes up in geometric progression, while the
  indices j are consecutive. The wavelets with intermediate scales --
  such as h^3_0 -- are redundant, in the sense that the corresponding
  coefficients Y^n_j can be computed from the coefficients of the
  wavelets listed above. Thus, for a signal of length L, we need only
  L/(w*n) wavelets of scale n, where w is the actual shift between
  h^1_0 and h^1_1. Therefore, the number of terms in each `layer' goes
  down geometrically, and there are only log(L) layers.
  
  The decomposition of a signal into a wavelet basis (3) is still
  fairly `local', in the sense that a change in one sample y(i)
  affects only O(log(L)) coefficients Y^n_j, corresponding to wavelets
  h^n_j() whose support includes the point i. On the other hand, a
  broad bump in the signal y() usually gets represented by a couple of
  large coefficients Y^n_j, corresponding to wavelets of roughly the
  same width and position as the bump, plus smaller `adjustment'
  coefficients with smaller scales.

  If we decompose a signal y() into non-redundant wavelets, and then
  take only the components with a fixed scale n, we are basically
  looking at the part of the signal that consists of details of size
  ~w*n. This is almost the same as decomposing y() into sinusoids of
  frequencies 1,2,3,..., and then adding only the components with
  frequencies between n and 2n.

  So a byproduct of wavelet analysis is a decomposition of the signal
  y() into log(L) parts or `bands,' one for each layer of the basis.
  If we add these components from the bottom up, after each stage we
  obtain a picture of y() that is twice as sharp as the previous one. In
  that case, the quantity S^n of formula (2) is the `strength' (or
  `power' in the usual jargon) of band n of the signal y().
  
  The kernels of definition (0), which are implied by the original LSC
  definition, are NOT orthogonal, but only because they are packed
  too tightly.  If we redefine them as

                 { +1 for 2jn \leq i < 2(j+1)n,
      h^n_j(i) = { -1 for 2(j+1)n \leq i < 2(j+2)n,         (0')
                 { 0 for all other values of i.
  
  (with successive wavelets spaced by 2n rather than n), then all
  members of family (3) become pairwise orthogonal, and (except for
  scale factors) we get the so-called `Haar wavelet basis'.
  
  The bottom-most layer of the Haar basis has only one half of a
  wavelet, whose full width is 2*L; the corresponding coefficient is
  proportional to the average of all samples y(i). The next layer has
  a single wavelet with period L; its coefficient gives the difference
  between the average value of y() in the half-range [0..L/2-1] and
  the average value in [L/2..L-1]. The next layer has two wavelets;
  their coefficients compare the average value of y() between the
  first two quarters of the range [0..L-1], and between the last two
  quarters. An so on.
  
  When we add all layers of the Haar decomposition with scales >= n,
  we get a staircase approximation to the signal y(), with steps of
  width n. Each wavelet in the next layer will split one of these
  steps in two steps of width n/2, raising one side and lowering the
  other by the same amount.
  
Variations on the LSC

  This reinterpretation of the LSC exposes several choices that
  were implicit in the original definition.  We can get many
  other `LSC-like' indicators by picking different alternatives for
  those choices.  
  
  In STEP 1 the given text gets converted into one or more numeric
    `signals' y(i). This step is perhaps the most arbitrary part of the
    definition. Almost any mapping here will do, as long as it is
    local --- i.e. changing one letter of the text will affect only
    one signal sample y(i) (or at most a few adjacent samples).
    
    In the original definition, each letter is mapped to 1 or 0
    depending on whether it is equal to or different from a given
    letter x. This simple mapping has the advantage that it makes the
    LSC invariant under simple letter substitutions. It may also be
    the `most sensitive' mapping under the appropriate definition of
    sensitivity. But we could consider other mappings, especially
    if we already know something about the alphabet.  For instance,
    we could map consonants to +1, vowels to -1; or map 
    each letter to a weight that depends on its frequency over the
    whole text.
    
    Note that it is fairly immaterial how we treat spaces,
    punctuation, case, diacritics, digraphs, etc. We can treat those
    features as letters, delete them, lump them with adjacent letters,
    whatever. These choices will normally preserve the long-range
    correlations (although they may affect the exact value of those
    correlations).
    
  In STEP 2 we analyze the signal y() against some repertoire
    of kernel functions h^n_j, and obtain the corresponding
    coefficients Y^n_j. 
    
    One conclusion we get from the `multiscale analysis'
    interpretation is that it is pointless to compute the LSC for all
    scales n between 1 and L. It is sufficient to consider only those
    scales n which are powers of two. This observation allows us to
    cut down the cost of computing the complete LSC plot from O(L**2)
    to O(L) --- a 10^5 speedup for typical texts!
    
    The multiscale analysis interpretation also suggests that we could
    use other kernel functions in lieu of formula (0). From that
    viewpoint, we can see that the shape of the kernels is not
    important, as long as the application of formula (1) has the
    effect of decomposing the signal y() into frequency bands related
    to the scale n. It is not even necessary for the kernels to have
    compact support. We could use for instance the Fourier kernel
    (sinusoids), numbered so that the kernels with a given scale n
    have frequencies between L/n and L/2n.
    
    As observed above, the kernels implied by the original LSC
    definition (formula (0)) are redundant when all scales are taken
    together, and are not orthogonal. However, they are not redundant
    if we consider a single scale n. In fact, the kernels { h^n_j,
    j=0..L/n-2 } are linearly independent, and just one degree of
    freedom short of being a complete basis; so they do provide good
    coverage.
    
    Another problem with the original kernels is that they are not
    orthogonal. This would be an inconvenient for other applications,
    but it does nor seem to be a problem if th eonly goal is to
    distinguish meaningful signals from random noise.
    
    A more serious defect is that the kernels are square-shaped with
    sharp discontinuities. Because of such discontinuities, each
    `layer' of the implied signal decomposition will cover a rather
    broad band of frequencies, and there will be significant overlap
    between bands. Unfortunately, I don't know how to explain this
    point in few words. 
    
    Anyway, the conclusion is that the use discontinuous kernels may
    result in `blurred' LSC-versus-n plots, with spurious peaks and
    valleys known as `Gibbs effects' or `ringing.' It may be worth
    trying smoother kernels, such as gaussians and their derivatives,
    or gaussian-sinusoid products.
    
    In fact, you should consider using smoothest possible kernels,
    namely the Fourier basis.  In other words, instead of the 
    LSC as defined, you could compute the power spectra of the 
    signals y(). 
    
  STEP 3 computes the sum S^n of the squares of the coefficients
    Y^n_j, for all j. I can't think of any useful alternative to this
    step. However, under the `signal analysis' interpretation, it is
    desirable to choose the kernel functions so that they are
    orthogonal and have unit norm (i.e. \sum_i h^n_j(i)**2 = 1). These
    conditions ensure that the power of the original signal y() is
    equal to the sum of all S^n.  That will make the value of S^n
    more meaningful, and easier to estimate for random texts.

Other comments

  In conclusion, my understanding of the Perakh-McKay papers is that
  computing the LSC is an indirect way of computing the power spectrum
  of the text. The reason why the LSC distinguishes meaningful texts
  from monkey gibberish is that the former have variations in letter
  frequencies at all scales, and hence a 1/f-like power spectrum;
  whereas the latter have uniform letter frequencies, at least over
  scales of a dozen letters, and therefore have a flat power spectrum.
  
  Looking at the LSC in the context of multiscale analysis suggests
  many possible improvements, such as using scales in geometric
  progression, and kernels which are smoother, orthogonal, and
  unitary.  Even if these changes do not make the LSC more sensitive,
  they should make the results easier to evaluate.
  
  In retrospect, it is not surprising that the LSC can distinguish the
  original Genesis from a line-permuted version: the spectra should be
  fairly similar at high frequencies (with periods shorter than one
  line), but at low frequencies the second text should have an
  essentially flat spectrum, like that of a random signal. The same
  can be said about monkey-generated texts.
  
  On the otherhand, I don't expect the LSC to be more effetive than
  simple letter/digraph frequency analysis when it comes to
  identifying the language of a text. The most significant influence
  in the LSC is the letter frequency histogram --- which is sensitive
  to topic (e.g. "-ed" is common when talking about past) and to
  spelling rules (e.g. whether one writes "ue" or "ü"). The shape of
  the LSC (or Fourier) spectrum at high frequencies (small n) must be
  determined mainly by these factors. The shape of the specrtum at
  lower frequencies (higher n) should be determined chiefly by topic
  and style.

  I have only skimmed through the sections about estimating the LSC
  for random texts. I didn't see anything wrong, but my impression is
  that those sections are overdoing it. For instance, it should not
  matter whether the letters are drawn with or without replacements,
  or whether the last chunk is complete or not: surely the effect of
  these details on the LSC must be less than the sampling error.
  (Compare, for instance, the LSC of the first half of Genesis with
  that of the second half.) Thus, I believe that these sections could
  be much shorter if these details were dismissed right at the
  beginning.
  
All the best,

--stolfi