# Re: Reinterpreting the LSC (long)

On 21 Jan 00, at 23:52, Jorge Stolfi wrote:
> The phenomenon has been observed in music, images, DNA
> sequences, etc. This knowledge has been useful for, among other
> things, designing good compression and approximation methods for
> such signals.

On the DNA arena there has been a never-ending debate between 2 groups. I thought that it would be interesting to some (after all DNA is also a symbolic sequence that carries a messsage), what is this all about and whether one can do a similar thing with the vms.

One group has done exactly what Jorge proposed:

> STEP 1 of the LSC computation consists in replacing each letter Y by
> a vector of z zeros and ones, where the r-th component of the
> vector is 1 if and only if Y is the r-th letter of the alphabet.

I think that this has been called a binary coding of DNA using a Heaviside function.

for base A:

ATCGAAGTACGC....
100011001000....

and so on for the other bases. These are then submitted to a Fast Fourier Transform and the power spectrum is plotted as log(power) vs. log(frequency). Slopes around -1 are characteristic of 1/f noise, 0 is white noise, -2 is Brown or brownian noise, -3 is black noise.
Using this method for the 4 bases separated or the average spectrum of 4, there are long range correlation but at the very long range. There have been claims of different slopes in different species and this dependent on evolution.

Another group used a more arbitrary method. As DNA's 4 bases are of 1 of 2 different types (purinic or pirimidinic) they construct a 1-dimensional random walk based on whether the next base is of one type or another (thus going up or down). This walk is then submitted to something called R/S analysis in which the sequence is divided in chunks, the increments in the sequence calculated and then a plot of log(segment size) vs log(range / standard deviation of the increments) (hence R/S). Slopes (Hurst exponent) of 0.5 are characteristic of brownian motion (which is the integral of white noise), larger than than make the sequence "persistent" and smaller than 0.5 make it "anti-persistent".
The only sensible thing here would be to make the random walk embedded in the 4-base space, but apparently if you do that, what they try to show does not always show up (!). Note that this "base- type" encoding is quite arbitrary because it is not the same thing to switch the bases according to their type alone. This is like saying that one can re-code a language based on whether the letters have "roundy" bits (a,o,d,b,q,p) or not (i,t,y,x, etc..) (so take your own conclusions).

Anyway, that group claims that non-coding areas of DNA (the so- called junk DNA) have long range correlations, while the coding (genes) do not. The finding is interesting, but to date I do not think there are any clues about the meaning or relevance of this. Why? because that "random walk" does not fully carry the message of DNA.

Of course this can be applied to the VMS, but the only problem is that we know that there are a number of pages missing, so our sequence is not continuous. It may be interesting to try anyway.

I've been about to do some of this since I joined the list. Perhaps it is time...

cheers,

Gabriel