[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: VMs: VBScript for finding repeating strings
It took just 17 hours of number crunching. The full output is now here:
After all these years I'm still not familiar enough with the VMS to tell
whether this makes sense or not. Does the VMS indeed fall into three
different parts? Am I seeing the same segmentation as Stolfi does here:
There is also a list of Voynichese prefixes and suffixes that roll
spontaneously out of this approach. Again, I'm completely in the dark about
what it means, if it means anything :-) But it looks familiar.
> I am trying to understand the "A" images. Referring to Test 2B, I
> think the relatively open vertical bars indicate that few new
> qualifying strings were found in the interval (x-axis ca. 12000) and
> the relatively open horizontal bars show a lapse in the text in which
> there are few repetitions (y-axis ca. 48000).
I think you're mostly right. I say "mostly" because I myself don't yet know
what "right" is. But I think about it the same way as you.
If these are matching strings in a document :
Then the X coordinate is the starting point of the first string, and Y the
starting point of the second string.
> Different subject matter?
The dark triangles are parts with a strong internal coherence. Maybe same
subject matter (see for example test07 where you can clearly see three
modules of C++) , but maybe also the same language (see for example test03
where you can see the difference between Dutch and English).
The sparsely-filled open bars would form borders of
> triangles. Does the lower edge show first occurrences of qualifying
> strings? But do I see dots above gaps in the lower edge?
At the moment you guess is as good as mine. But it's something in this
> How would a "random" text for comparison be constructed? Line shuffling?
I have used random input in test05, and as is to be expected there is no
> In the tests, can strings be partially on one line and partially on
> the next line?
Yes they can. I first join all lines by repacing the Newline with a Space.
And just to be sure, I scan for double spaces and remove them before the
I have one strange effect that I don't understand. If I scan a small part of
the VMS (test04) I get a recognizeable pattern. Then if I remove all the
spaces (test08) I find a whole lot less matching strings. I wouldn't expect
spaces to be relevant, an my first guess would be that I find the same
matching strings, only with the spaces removed. But I get a really different
result. Or is it a bug in my script or a bug in my reasoning?
I certainly hope I'm not wasting your time with a faulty or irrelevant
To unsubscribe, send mail to majordomo@xxxxxxxxxxx with a body saying: