[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

VMs: Re: Numbercrunching "word" tuples



When I opened the messages on this subject, I
thought that they would be about the recent
suggestion of looking for repeated segments
of text, and where they occur.

I have just done this for the Easter Island
tablets -- rather, for a numerical transliteration
of them where each sign is represented by a
three-digit code.

The corpus is about 12,000 signs long (the VMS
is about 240,000 characters long). At first,
I looked for repeating sequences of 10 signs.
But then I realized that it was silly, since
if, say, on tablet X you had the same signs
repeated as on tablet Y, but with one missing
or in a different position, this close match
would be overlooked.

So eventually, I settled for sequences of 5
signs, _anagrammed_, that is, the same 5 signs
in any order; and instead of examining the
texts in chunks of five signs, I had the
algorithm "crawl" through it sign by sign.
For instance, it compares signs 1 to 5 of tablet
A with signs 1 to 5 of tablet B, then with signs
2 to 6, 3 to 7 and so on.  

I feared it would be terribly slow. At first it
was, but I found that that was the effect of
memory fragmentation. I rewrote parts of the
code and it now carries out some 80,000
comparisons per second, so that comparing
two typical tablets, each 1000 signs long,
takes only 1000*1000/80000 = about 12 seconds
(and that on an "old" PC running at 200MHz).

I will shortly be writing an addendum to
the algorithm, where it merges the overlapping
similar segments which it has identified.

Taking the VMS to be 240k long, this algorithm,
on this machine, would take 240k^2/2/80k = 100hrs
to run. And, I sumise, only 10hrs on a modern
machine. The only worry might be the amount of
output generated.


Here is a sample. The code in
the leftmost column give the
tablet (single uppercase letter)
it side (r for recto, v for verso)
its line (e.g. 02)
the position of the first sign
of the group on the line (e.g. 008,
eighth sign, counting from the
beginning of the line)

Gr02.008  073.006-200-073.006-
Kr02.020  073.006-200-073.006-

Gr02.009  006-200-073.006-222-
Kr02.021  006-200-073.006-222-

Gr02.010  200-073.006-222-073V.
Kr02.022  200-073.006-222-073V.

Gr02.011  073.006-222-073V.006-
Kr02.023  073.006-222-073V.006-

The algorithm also keeps a record
of groups with only 4 matching
signs, e.g.


Aa03.002  060a.003a-461a-060a.003a-
Ab02.031  060a.003a-600-060a-003a-


To give you an example of the size
of these output files, the file listing
sequences of five signs all matching
is only 84,812 bytes, but the one
listing sequences with 4 matching signs
out of 5 is 963,451 bytes (that is why
I did not record matches of only 3
signs and less -- I only kept a count).

Conclusion?  There is really nothing
preventing us from doing it for the 
VMS. Even on my aging machine, that 
won't take more that 5 days.