[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

VMs: Petr's repeated strings plots

Petr asks

  > I have one strange effect that I don't understand. If I scan a
  > small part of the VMS (test04) I get a recognizeable pattern. Then
  > if I remove all the spaces (test08) I find a whole lot less
  > matching strings. I wouldn't expect spaces to be relevant, an my
  > first guess would be that I find the same matching strings, only
  > with the spaces removed. But I get a really different result. Or
  > is it a bug in my script or a bug in my reasoning?

If you are looking for strings of the same length (say 12 characters)
then in the second case you probably get more varied strings (i.e.
more bits of information), because spaces are fairly predictable from 
the other letters.  More varied strings mean that the chances of 
finding a match are smaller.

Said another way, if before you had a match between "foo bar quux" and 
"foo bar quux", after deleting the spaces you may get "foobarbazquuxaa"
and "foobarbazquuxbb" which do not match.  The reverse phenomenon 
(non-matches becoming matches) is much less likely, again because 
the spaces are quite predictable.

As for interpreting the plots: each dark triangle along the diagonal
correspond to a set of pages with distinctive word (or word-pair)
frequencies. Rectangles off the diagonal that are just as dark as the
triangles may reveal sections that have been split by page shuffling.

Rectangles with distinctive shades of gray are very interesting, as they
indicate two sections with similar but not identical word frequencies.

The labels are known to be essentially unique, so each batch of labels
in the input file should show up as a small white triangle along the
diagonal; and two batches of labels will generate a white rectangle off
the diagonal.  Matches between labels and other sections (which show 
up as dots in the vertical or horizontal band corresponding to the 
label batch) are quite interesting too.

All the best,

Jorge Stolfi
To unsubscribe, send mail to majordomo@xxxxxxxxxxx with a body saying:
unsubscribe vms-list