[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

*To*: perakh@xxxxxxxxxxx*Subject*: Reinterpreting the LSC (long)*From*: Jorge Stolfi <stolfi@xxxxxxxxxxxxxx>*Date*: Fri, 21 Jan 2000 23:52:07 -0200 (EDT)*Cc*: voynich@xxxxxxxx*Delivered-to*: reeds@research.att.com*In-reply-to*: <3886379D.4D1B8525@nctimes.net>*References*: <12AJxt-0OiTYmC@fwd03.sul.t-online.de> <3886379D.4D1B8525@nctimes.net>*Reply-to*: stolfi@xxxxxxxxxxxxxx*Sender*: jim@xxxxxxxxxxxxx

Mark, I haven't had the time to read your papers carefully enough, but the idea of the LSC seems quite interesting. Here are some of my comments. Why should the LSC work? In a very broad sense, the LSC and the nth-order character/word entropies are trying to measure the same thing, namely the correlation between letters that are a fixed distance apart. People have observed before that correlation between samples n steps apart tends to be higher for "meaningful" signals than for "random" ones, even for large n. The phenomenon has been observed in music, images, DNA sequences, etc. This knowledge has been useful for, among other things, designing good compression and approximation methods for such signals. Some of the buzzwords one meets in that context are "fractal", "1/f noise", "wavelet", "multiscale energy", etc. (I believe that Gabriel has written papers on fractals in the context of medical imaging. And a student of mine just finished her thesis on reassembling pottery fragments by matching their outlines, which turn out to be "fractal" too.) As I try to show below, one can understand the LSC as decomposing the text into various frequency bands, and measuring the `power' contained in each band. If we do that to a random signal, we will find that each component frequency has roughly constant expected power; i.e. the power spectrum is flat, like that of ideal white light (hence the nickname `white noise'.) On the other hand, a `meaningful' signal (like music or speech) will be `lumpier' than a random one, at all scales; so its power spectrum will show an excess power at lower frequencies. It is claimed that, in such signals, the power tends to be inversely proportional to the frequency; hence the moniker `1/f noise'. If we lump the spectrum components into frequency bands, we will find that the total power contained in the band of frequencies between f and 2f will be proportional to f for a random signal, but roughly constant for a `meaningful' signal whose spectrum indeed follows the 1/f profile. Is the LSC better than nth-order entropy? In theory, the nth-order entropies are more powerful indicators of structure. Roughly speaking, *any* regular structure in the text will show up in some nth-order entropy; whereas I suspect that one can construct signals that have strong structure (hence low entropy) but the same LSC as a purely random text. However, the formula for nth-order entropy requires one to estimate z**n probabilities, where z is the size of the alphabet. To do that reliably, one needs a corpus whose length is many times z**n. So the entropies are not very meaningful for n beyond 3 or so. The nth-order LSC seems to be numerically more stable, because it maps blocks of n consecutive letters into a single `super-letter' which is actually a vector of z integers; and compares these super-letters as vectors (with difference-squared metric) rather than symbols (with simple 0-1 metric). I haven't done the math --- perhaps you have --- but it seems that computing the n-th order LSC to a fixed accuracy requires a corpus whose length L is proportional to z*n (or perhaps z*n**2?) instead of z**n. Morever, one kind of structure that the LSC *can* detect is any medium- and long-range variation in word usage frequency along the text. (In fact, the LSC seems to have been designed specifically for that purpose.) As observed above, such variations are present in most natural languages, but absent in random texts, even those generated by kth-order monkeys. Specifically, if we take the the output of a k-th order `letter monkey' and break it into chunks whose length n >> k, we will find that the number of times a given letter occurs in each chunk is fairly constant (except for sampling error) among all chunks. For kth-order `word monkeys' we should have the same result as long as n >> k*w, where w is the average word length. On the other hand, a natural-language text will show variations in letter frequencies, which are due to changes of topic and hence vocabulary changes, that extend for whole paragraphs or chapters. Thus, although the LSC may not be powerful enough to detect the underlying structure in non-trivial ciphers, it seems well suited at distinguishing natural language from monkey-style random text. Another view of LSC Someone -- I think it was Goethe -- observed that mathematicians are like Frenchmen: whatever you say to them, they immediately translate into their language, and it then becomes something else entirely. Although I am not a mathematician, I cannot resist the chance to translate your LSC definition into my own brand of `French.' Here it goes: STEP 1 of the LSC computation consists in replacing each letter Y by a vector of z zeros and ones, where the r-th component of the vector is 1 if and only if Y is the r-th letter of the alphabet. Thus the original text is replaced by z binary strings of the same length. Steps 2 and 3 below are applied to each of these z binary strings, separately. STEP 2 takes a binary string y(i), i = 0..L-1 and produces a string of integers Y^n_j, j = 0..L/n-2, by the formula Y^n_j = \sum_{i=0}^{L-1} h^n_j(i) y(i), (1) where h^n_j() is the `kernel function' { +1 for jk \leq i < (j+1)n, h^n_j(i) = { -1 for (j+1)n \leq i < (j+2)n, (0) { 0 for all other values of i. In words, h^n_j() is a train of n `+1's followed by n `-1's, all surrounded by a sea of zeros. (Here "^n" is merely a superscript index, not a power.) STEP 3 takes the numbers Y^n_j and computes the sums S^n = \sum_{j=1}^{L/n-2} Y^n_j**2 (2) STEP 4 adds together the sums S^n_x obtained for different letters x, into a single sum S^n_tot. LSC as multiscale signal analysis The point of using this convoluted formalism is that formula (1) can be interpreted as a `scalar product' between the `signal' y() and the `kernel' h^n_j(). Except for a constant factor, the coefficients Y^n_j is therefore a measure of `how much' of h^n_j() is contained in the signal y(). This interpretation of formula (1) is not restricted to the particular kernel functions h^n_j of definition (0), but can be used with any set of kernels h_r(); in that case, we would write Y_r = \sum_{i=0}^{L-1} h_r(i) y(i) (1') This interpretation becomes more interesting when the kernel funtions h_r() are orthonormal, i.e. \sum_i h_r(i) h_s(i) = 0 for r different from s, and \sum_i h_r(i)**2 = 1. Then the signal y~(i) = \sum_r Y_r h_r(i) is the combination of kernel functions that best approximates the signal y(). In particular, if every signal y() is a linear combination of kernel functions (i.e. the kernels form a basis of signal space), the numbers Y_r are the coefficients of that combination. The decomposition of a signal y() into a combination of certain kernel functions is a basic tool of signal processing. For instance, in Fourier theory the kernels h_r are sinusoids and co-sinusoids on the parameter i with various frequencies, and the numbers Y_r are the corresponding Fourier coefficients. The Fourier kernels have many nice properties, but they extend over the whole i range; thus changing a single sample y(i) will change all the coefficients Y_r. For that reason, in some applications people prefer to use kernels that have a bounded support --- i.e. they are zero, or practically zero, outside of a small range of the i parameter. A set of such kernels is called a {\em spline basis}. However, if Fourier kernels suffer for being too `global', spline kernels suffer for being too `local': in order to approximate a broad hump y() one nees to add many narrow bumps. {\em Wavelets} were invented to fill the gap between these two extremes. Note that all the functions h^n_j(j) defined by formula (0) are just like the function h^1_0(), only stretched by a factor of n and shifted by n*j. This scale-and-shift property is basically the definition of a `wavelet family'. The parameter n is called the `scale' of the wavelet, and the index j could be called its `phase'. Wavelet theory is based on the observation that, if one picks a `mother wavelet' h^1_0() with the right shape, then the following wavelets are mutually orthogonal and span the space of all signals: h^1_0, h^1_1, h^1_2, h^1_3, h^1_4, h^1_5, ..., h^2_0, h^2_1, h^2_2, h^2_3, h^2_4, h^2_5, ..., h^4_0, h^4_1, h^4_2, h^4_3, h^4_4, h^4_5, ..., (3) h^8_0, h^8_1, h^8_2, h^8_3, h^8_4, h^8_5, ..., Note that the scale n goes up in geometric progression, while the indices j are consecutive. The wavelets with intermediate scales -- such as h^3_0 -- are redundant, in the sense that the corresponding coefficients Y^n_j can be computed from the coefficients of the wavelets listed above. Thus, for a signal of length L, we need only L/(w*n) wavelets of scale n, where w is the actual shift between h^1_0 and h^1_1. Therefore, the number of terms in each `layer' goes down geometrically, and there are only log(L) layers. The decomposition of a signal into a wavelet basis (3) is still fairly `local', in the sense that a change in one sample y(i) affects only O(log(L)) coefficients Y^n_j, corresponding to wavelets h^n_j() whose support includes the point i. On the other hand, a broad bump in the signal y() usually gets represented by a couple of large coefficients Y^n_j, corresponding to wavelets of roughly the same width and position as the bump, plus smaller `adjustment' coefficients with smaller scales. If we decompose a signal y() into non-redundant wavelets, and then take only the components with a fixed scale n, we are basically looking at the part of the signal that consists of details of size ~w*n. This is almost the same as decomposing y() into sinusoids of frequencies 1,2,3,..., and then adding only the components with frequencies between n and 2n. So a byproduct of wavelet analysis is a decomposition of the signal y() into log(L) parts or `bands,' one for each layer of the basis. If we add these components from the bottom up, after each stage we obtain a picture of y() that is twice as sharp as the previous one. In that case, the quantity S^n of formula (2) is the `strength' (or `power' in the usual jargon) of band n of the signal y(). The kernels of definition (0), which are implied by the original LSC definition, are NOT orthogonal, but only because they are packed too tightly. If we redefine them as { +1 for 2jn \leq i < 2(j+1)n, h^n_j(i) = { -1 for 2(j+1)n \leq i < 2(j+2)n, (0') { 0 for all other values of i. (with successive wavelets spaced by 2n rather than n), then all members of family (3) become pairwise orthogonal, and (except for scale factors) we get the so-called `Haar wavelet basis'. The bottom-most layer of the Haar basis has only one half of a wavelet, whose full width is 2*L; the corresponding coefficient is proportional to the average of all samples y(i). The next layer has a single wavelet with period L; its coefficient gives the difference between the average value of y() in the half-range [0..L/2-1] and the average value in [L/2..L-1]. The next layer has two wavelets; their coefficients compare the average value of y() between the first two quarters of the range [0..L-1], and between the last two quarters. An so on. When we add all layers of the Haar decomposition with scales >= n, we get a staircase approximation to the signal y(), with steps of width n. Each wavelet in the next layer will split one of these steps in two steps of width n/2, raising one side and lowering the other by the same amount. Variations on the LSC This reinterpretation of the LSC exposes several choices that were implicit in the original definition. We can get many other `LSC-like' indicators by picking different alternatives for those choices. In STEP 1 the given text gets converted into one or more numeric `signals' y(i). This step is perhaps the most arbitrary part of the definition. Almost any mapping here will do, as long as it is local --- i.e. changing one letter of the text will affect only one signal sample y(i) (or at most a few adjacent samples). In the original definition, each letter is mapped to 1 or 0 depending on whether it is equal to or different from a given letter x. This simple mapping has the advantage that it makes the LSC invariant under simple letter substitutions. It may also be the `most sensitive' mapping under the appropriate definition of sensitivity. But we could consider other mappings, especially if we already know something about the alphabet. For instance, we could map consonants to +1, vowels to -1; or map each letter to a weight that depends on its frequency over the whole text. Note that it is fairly immaterial how we treat spaces, punctuation, case, diacritics, digraphs, etc. We can treat those features as letters, delete them, lump them with adjacent letters, whatever. These choices will normally preserve the long-range correlations (although they may affect the exact value of those correlations). In STEP 2 we analyze the signal y() against some repertoire of kernel functions h^n_j, and obtain the corresponding coefficients Y^n_j. One conclusion we get from the `multiscale analysis' interpretation is that it is pointless to compute the LSC for all scales n between 1 and L. It is sufficient to consider only those scales n which are powers of two. This observation allows us to cut down the cost of computing the complete LSC plot from O(L**2) to O(L) --- a 10^5 speedup for typical texts! The multiscale analysis interpretation also suggests that we could use other kernel functions in lieu of formula (0). From that viewpoint, we can see that the shape of the kernels is not important, as long as the application of formula (1) has the effect of decomposing the signal y() into frequency bands related to the scale n. It is not even necessary for the kernels to have compact support. We could use for instance the Fourier kernel (sinusoids), numbered so that the kernels with a given scale n have frequencies between L/n and L/2n. As observed above, the kernels implied by the original LSC definition (formula (0)) are redundant when all scales are taken together, and are not orthogonal. However, they are not redundant if we consider a single scale n. In fact, the kernels { h^n_j, j=0..L/n-2 } are linearly independent, and just one degree of freedom short of being a complete basis; so they do provide good coverage. Another problem with the original kernels is that they are not orthogonal. This would be an inconvenient for other applications, but it does nor seem to be a problem if th eonly goal is to distinguish meaningful signals from random noise. A more serious defect is that the kernels are square-shaped with sharp discontinuities. Because of such discontinuities, each `layer' of the implied signal decomposition will cover a rather broad band of frequencies, and there will be significant overlap between bands. Unfortunately, I don't know how to explain this point in few words. Anyway, the conclusion is that the use discontinuous kernels may result in `blurred' LSC-versus-n plots, with spurious peaks and valleys known as `Gibbs effects' or `ringing.' It may be worth trying smoother kernels, such as gaussians and their derivatives, or gaussian-sinusoid products. In fact, you should consider using smoothest possible kernels, namely the Fourier basis. In other words, instead of the LSC as defined, you could compute the power spectra of the signals y(). STEP 3 computes the sum S^n of the squares of the coefficients Y^n_j, for all j. I can't think of any useful alternative to this step. However, under the `signal analysis' interpretation, it is desirable to choose the kernel functions so that they are orthogonal and have unit norm (i.e. \sum_i h^n_j(i)**2 = 1). These conditions ensure that the power of the original signal y() is equal to the sum of all S^n. That will make the value of S^n more meaningful, and easier to estimate for random texts. Other comments In conclusion, my understanding of the Perakh-McKay papers is that computing the LSC is an indirect way of computing the power spectrum of the text. The reason why the LSC distinguishes meaningful texts from monkey gibberish is that the former have variations in letter frequencies at all scales, and hence a 1/f-like power spectrum; whereas the latter have uniform letter frequencies, at least over scales of a dozen letters, and therefore have a flat power spectrum. Looking at the LSC in the context of multiscale analysis suggests many possible improvements, such as using scales in geometric progression, and kernels which are smoother, orthogonal, and unitary. Even if these changes do not make the LSC more sensitive, they should make the results easier to evaluate. In retrospect, it is not surprising that the LSC can distinguish the original Genesis from a line-permuted version: the spectra should be fairly similar at high frequencies (with periods shorter than one line), but at low frequencies the second text should have an essentially flat spectrum, like that of a random signal. The same can be said about monkey-generated texts. On the otherhand, I don't expect the LSC to be more effetive than simple letter/digraph frequency analysis when it comes to identifying the language of a text. The most significant influence in the LSC is the letter frequency histogram --- which is sensitive to topic (e.g. "-ed" is common when talking about past) and to spelling rules (e.g. whether one writes "ue" or "ü"). The shape of the LSC (or Fourier) spectrum at high frequencies (small n) must be determined mainly by these factors. The shape of the specrtum at lower frequencies (higher n) should be determined chiefly by topic and style. I have only skimmed through the sections about estimating the LSC for random texts. I didn't see anything wrong, but my impression is that those sections are overdoing it. For instance, it should not matter whether the letters are drawn with or without replacements, or whether the last chunk is complete or not: surely the effect of these details on the LSC must be less than the sampling error. (Compare, for instance, the LSC of the first half of Genesis with that of the second half.) Thus, I believe that these sections could be much shorter if these details were dismissed right at the beginning. All the best, --stolfi

**Follow-Ups**:**Re: Reinterpreting the LSC (long)***From:*Gabriel Landini

**References**:**A few LSC comments***From:*Rene

**Re: A few LSC comments***From:*Mark Perakh

- Prev by Date:
**Gif...** - Next by Date:
**Re: Reinterpreting the LSC (long)** - Previous by thread:
**Re: A few LSC comments** - Next by thread:
**Re: Reinterpreting the LSC (long)** - Index(es):