[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: VMs: Number crunching the Fincher window

I sent this yesterday, but I haven't received it or seen it in the archive.
Sorry if you received this twice.


Subject: Re: VMs: Number crunching the Fincher window
Date: Tuesday 14 September 2004 15:43
From: Gabriel Landini <G.Landini@xxxxxxxxxx>
To: vms-list@xxxxxxxxxxx

On Tuesday 14 September 2004 14:31, elvogt@xxxxxxxxxxx wrote:
> If Marke is right (and I understand him correctly), the VM is a hoax.

I still do not fully understand why one should be interested in these sorted
sub-strings (or substring families), but I find it curious that they should
imply a hoax. Why? Mostly because it seems to be unknown what is the effect
of this sub-string distribution in other languages and in versions of any
word-scrambled texts.
It may be more common than one thinks or it can be meaningless.

I have the suspicion that this effect may also be related to a measure of 
string complexity in terms of Lempel-Ziv entropy.
Consider that the counting commas algorithm does something (remotely) similar
but in the opposite direction (parses the stream into segments that contain
new strings and one ends up with a dictionary of substrings).

So for a moment think about making the string search it in these terms:
For a string s (window size n), calculate the probability of s in the corpus
by counting the hits when one slides the window through the entire stream.
Repeat for all existing strings of size n.
Then repeat for strings with n-1 characters and so on until n=1.
Now from the distributions of s at size n, one can calculate the entropy of
the n-plets. Then plot this entropy as a function of n.
That graph could be a descriptor of the repeatability of the substrings and
takes in consideration all strings at all sizes. One could do this for
several texts and languages.

Now (if I got it right) the 'Fincher' strings are the subset from the entire
collection of strings starting from length n=1 that can be mapped into
super-strings size n+1 (=2) plus those that can also be mapped in to n=3
and so on until one reaches some arbitrary high n.
Considering that vms word structure is quite rigid (and with low entropy)
 then I would expect that this mapping up (or down if one starts from the
 longest strings) would be more common in the vms than in sequences with
 higher entropy.
I therefore suspect that looking at the distribution of 'Fincher' strings is
an indirect measure of string complexity and entropy.
If that is the case, then I am not sure that they'll tell us anything new.



To unsubscribe, send mail to majordomo@xxxxxxxxxxx with a body saying:
unsubscribe vms-list