[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

VMs: Words with localized occurrence



First, some errata:

  > [Stolfi:] seen by Gabriel, Merk Perakh, and others. In spectral
  > analysis, in particular, they enhance the low-frequency components
  > (long waves), at the expense of the low-frequency ones.

Read: *Mark* Perakh, and "at the expense of *high*-frequency ones."

  > [Bruce:] Another explanation occurs to me for words which recurr
  > in one part of a text and not in another: if you are attempting to
  > create "random" text, it is not unusual to find yourself repeating
  > the same word until you realize that you have been using it a lot,
  > and start avoiding it.

Perhaps... but I doubt whether this effect can be strong enough to
register, or that the author would have bothered to check it.

Some words like EVA "daiin" occur extremely often in the VMS, and most
of them have relatively uniform distributions through all sections. So
if the author was worried about excessive repetition of, say, why
didn't he care about those words too?

I have just spent a couple of hours looking for words that (like
"brother" in WotW) occur confined to a much smaller range of positions
than would be expected by chance. There are hundreds of such words in
the VMS, and, predictably, most of them disappear if the words are
scrambled. Moreover, when I repeated the analysis on the WotW novel
(truncated to the same number of tokens), I got almost identical
results -- many "strongly localized" words in the original text, very
few in the scrambled text.

I hope to have time to set up a page with this analysis. Meanwhile,
here are a few data points. Both files have about 47,000 tokens,
including line break and paragraph break markers, invalid readings,
etc. Here are the words that occur at least 4 times (excluding single
letters in key sequences etc.), with all occurrences contained in a
4000-token range in the uscrambled files:

VMS
   
  COUNT   FIRST    LAST    SPAN WORD
  -----   -----   -----   ----- ---------
      4    3805    7398    3593 okoy
      4    4614    6566    1952 chcth
      4    4917    8138    3221 chekedy
      4    9566   11757    2191 chokchol
      4   14878   16773    1895 oteotey
      4   26581   29242    2661 otodar
      4   29013   32604    3591 keeos
      5   36499   38764    2265 lkal
      4   37853   41296    3443 lkeeedy

WotW

  COUNT   FIRST    LAST    SPAN WORD
  -----   -----   -----   ----- ---------
      5    1376    2077     701 telescope
      5    2677    3580     903 meteorite
      4    5245    8510    3265 humming
      5    7681    8558     877 mirror
      4    9617   10198     581 wine
      8   12698   16352    3654 college
      5   13037   15998    2961 landlord
      4   17203   17557     354 wiped
     14   17796   20162    2366 artilleryman
      6   19172   19395     223 lieutenant
      7   19247   19347     100 sir
      4   22003   24657    2654 path
      4   22690   26450    3760 power
      4   24740   25718     978 alarmed
      4   25825   26121     296 south~western
      4   26938   27156     218 strand
      5   28998   31388    2390 george's
      4   29471   31238    1767 ditton
      6   29733   33680    3947 slender
      4   33070   33252     182 lady
      5   33087   36443    3356 revolver
      4   34464   36285    1821 wretched
      4   36657   37792    1135 northern
      9   37543   40031    2488 sea
     10   38019   40188    2169 coast
      4   38176   39990    1814 ships
     10   38505   40241    1736 steamer
      7   38558   40275    1717 steamboat
      5   38615   40311    1696 captain
      6   38670   40099    1429 seaward
      4   38690   39895    1205 funnels
      4   38732   40150    1418 ironclads
      4   40637   43447    2810 curate's
     11   42217   43686    1469 kitchen
      4   42407   43506    1099 plaster
      4   42953   43689     736 scullery

In the scrambled versions of both files, there are *no* words that
meet these conditions. That is, every word that occurs 4 or more times
spans more than 4000 tokens.

Along the same vein, these words occur exactly twice, at most 20
tokens apart:

VMS:

    FIRST    LAST    SPAN WORD
    -----   -----   ----- ---------
      644     650       6 damo
     6713    6731      18 olchdaiin
    16965   16969       4 cheteey
    27756   27757       1 qoekol
    36106   36116      10 otaraiin
    36462   36465       3 chtl

WotW:

    FIRST    LAST    SPAN WORD
    -----   -----   ----- ---------
      789     791       2 generation
     5156    5159       3 a~screwin'
     6428    6430       2 joint
     6531    6538       7 ugly
     6532    6539       7 brutes
     6914    6915       1 flutter
     8207    8214       7 novelty
    11097   11104       7 fringe
    15267   15268       1 aloo
    17366   17370       4 shield
    18391   18403      12 streamed
    18414   18422       8 pillars
    23663   23665       2 rows
    29884   29897      13 uppermost
    31994   31997       3 losing
    33642   33655      13 deepened
    34090   34094       4 girls
    35385   35399      14 ellen
    35568   35571       3 garrick
    36993   36997       4 stampede
    37729   37738       9 midland

Note that these include the successive repeats "qoekol qoekol", 
"flutter flutter" and "aloo aloo". 

As before, the scrambled files had no hits --- i.e. all words with two
occurrences were spread wider than 20 tokens.

In conclusion, both books contain many words that are confined
to specific sections, far more than expected by chance.
Of course this does not prove anything, but is yet another
constraint on proposed theories. 

Surprisingly, the WotW has more "lumpy" words than the VMS. This may
be due to the fact that different sections are partly mixed in the
VMS.  I need to find a metric of "lumpiness" that degrades more
graciously with such block-scrambling.

All the best,

--stolfi


______________________________________________________________________
To unsubscribe, send mail to majordomo@xxxxxxxxxxx with a body saying:
unsubscribe vms-list