[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: VMs: VBScript for finding repeating strings

It took just 17 hours of number crunching. The full output is now here:

After all these years I'm still not familiar enough with the VMS to tell
whether this makes sense or not. Does the VMS indeed fall into three
different parts? Am I seeing the same segmentation as Stolfi does here:

There is also a list of Voynichese prefixes and suffixes that roll
spontaneously out of this approach. Again, I'm completely in the dark about
what it means, if it means anything :-) But it looks familiar.

> I am trying to understand the "A" images. Referring to Test 2B, I
> think the relatively open vertical bars indicate that few new
> qualifying strings were found in the interval (x-axis ca. 12000) and
> the relatively open horizontal bars show a lapse in the text in which
> there are few repetitions (y-axis ca. 48000).

I think you're mostly right. I say "mostly" because I myself don't yet know
what "right" is. But I think about it the same way as you.

If these are matching strings in a document :


Then the X coordinate is the starting point of the first string, and Y the
starting point of the second string.

> Different subject matter?

The dark triangles are parts with a strong internal coherence. Maybe same
subject matter (see for example test07 where you can clearly see three
modules of C++) , but maybe also the same language (see for example test03
where you can see the difference between Dutch and English).

 The sparsely-filled open bars would form borders of
> triangles. Does the lower edge show first occurrences of qualifying
> strings? But do I see dots above gaps in the lower edge?

At the moment you guess is as good as mine. But it's something in this

>  How would a "random" text for comparison be constructed? Line shuffling?

I have used random input in test05, and as is to be expected there is no
apparent pattern.

> In the tests, can strings be partially on one line and partially on
> the next line?

Yes they can. I first join all lines by repacing the Newline with a Space.
And just to be sure, I scan for double spaces and remove them before the

I have one strange effect that I don't understand. If I scan a small part of
the VMS (test04) I get a recognizeable pattern. Then if I remove all the
spaces (test08) I find a whole lot less matching strings. I wouldn't expect
spaces to be relevant, an my first guess would be that I find the same
matching strings, only with the spaces removed. But I get a really different
result. Or is it a bug in my script or a bug in my reasoning?

I certainly hope I'm not wasting your time with a faulty or irrelevant
result :-)

Greetings, Petr

To unsubscribe, send mail to majordomo@xxxxxxxxxxx with a body saying:
unsubscribe vms-list