[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
VMs: Words with localized occurrence
First, some errata:
> [Stolfi:] seen by Gabriel, Merk Perakh, and others. In spectral
> analysis, in particular, they enhance the low-frequency components
> (long waves), at the expense of the low-frequency ones.
Read: *Mark* Perakh, and "at the expense of *high*-frequency ones."
> [Bruce:] Another explanation occurs to me for words which recurr
> in one part of a text and not in another: if you are attempting to
> create "random" text, it is not unusual to find yourself repeating
> the same word until you realize that you have been using it a lot,
> and start avoiding it.
Perhaps... but I doubt whether this effect can be strong enough to
register, or that the author would have bothered to check it.
Some words like EVA "daiin" occur extremely often in the VMS, and most
of them have relatively uniform distributions through all sections. So
if the author was worried about excessive repetition of, say, why
didn't he care about those words too?
I have just spent a couple of hours looking for words that (like
"brother" in WotW) occur confined to a much smaller range of positions
than would be expected by chance. There are hundreds of such words in
the VMS, and, predictably, most of them disappear if the words are
scrambled. Moreover, when I repeated the analysis on the WotW novel
(truncated to the same number of tokens), I got almost identical
results -- many "strongly localized" words in the original text, very
few in the scrambled text.
I hope to have time to set up a page with this analysis. Meanwhile,
here are a few data points. Both files have about 47,000 tokens,
including line break and paragraph break markers, invalid readings,
etc. Here are the words that occur at least 4 times (excluding single
letters in key sequences etc.), with all occurrences contained in a
4000-token range in the uscrambled files:
VMS
COUNT FIRST LAST SPAN WORD
----- ----- ----- ----- ---------
4 3805 7398 3593 okoy
4 4614 6566 1952 chcth
4 4917 8138 3221 chekedy
4 9566 11757 2191 chokchol
4 14878 16773 1895 oteotey
4 26581 29242 2661 otodar
4 29013 32604 3591 keeos
5 36499 38764 2265 lkal
4 37853 41296 3443 lkeeedy
WotW
COUNT FIRST LAST SPAN WORD
----- ----- ----- ----- ---------
5 1376 2077 701 telescope
5 2677 3580 903 meteorite
4 5245 8510 3265 humming
5 7681 8558 877 mirror
4 9617 10198 581 wine
8 12698 16352 3654 college
5 13037 15998 2961 landlord
4 17203 17557 354 wiped
14 17796 20162 2366 artilleryman
6 19172 19395 223 lieutenant
7 19247 19347 100 sir
4 22003 24657 2654 path
4 22690 26450 3760 power
4 24740 25718 978 alarmed
4 25825 26121 296 south~western
4 26938 27156 218 strand
5 28998 31388 2390 george's
4 29471 31238 1767 ditton
6 29733 33680 3947 slender
4 33070 33252 182 lady
5 33087 36443 3356 revolver
4 34464 36285 1821 wretched
4 36657 37792 1135 northern
9 37543 40031 2488 sea
10 38019 40188 2169 coast
4 38176 39990 1814 ships
10 38505 40241 1736 steamer
7 38558 40275 1717 steamboat
5 38615 40311 1696 captain
6 38670 40099 1429 seaward
4 38690 39895 1205 funnels
4 38732 40150 1418 ironclads
4 40637 43447 2810 curate's
11 42217 43686 1469 kitchen
4 42407 43506 1099 plaster
4 42953 43689 736 scullery
In the scrambled versions of both files, there are *no* words that
meet these conditions. That is, every word that occurs 4 or more times
spans more than 4000 tokens.
Along the same vein, these words occur exactly twice, at most 20
tokens apart:
VMS:
FIRST LAST SPAN WORD
----- ----- ----- ---------
644 650 6 damo
6713 6731 18 olchdaiin
16965 16969 4 cheteey
27756 27757 1 qoekol
36106 36116 10 otaraiin
36462 36465 3 chtl
WotW:
FIRST LAST SPAN WORD
----- ----- ----- ---------
789 791 2 generation
5156 5159 3 a~screwin'
6428 6430 2 joint
6531 6538 7 ugly
6532 6539 7 brutes
6914 6915 1 flutter
8207 8214 7 novelty
11097 11104 7 fringe
15267 15268 1 aloo
17366 17370 4 shield
18391 18403 12 streamed
18414 18422 8 pillars
23663 23665 2 rows
29884 29897 13 uppermost
31994 31997 3 losing
33642 33655 13 deepened
34090 34094 4 girls
35385 35399 14 ellen
35568 35571 3 garrick
36993 36997 4 stampede
37729 37738 9 midland
Note that these include the successive repeats "qoekol qoekol",
"flutter flutter" and "aloo aloo".
As before, the scrambled files had no hits --- i.e. all words with two
occurrences were spread wider than 20 tokens.
In conclusion, both books contain many words that are confined
to specific sections, far more than expected by chance.
Of course this does not prove anything, but is yet another
constraint on proposed theories.
Surprisingly, the WotW has more "lumpy" words than the VMS. This may
be due to the fact that different sections are partly mixed in the
VMS. I need to find a metric of "lumpiness" that degrades more
graciously with such block-scrambling.
All the best,
--stolfi
______________________________________________________________________
To unsubscribe, send mail to majordomo@xxxxxxxxxxx with a body saying:
unsubscribe vms-list