[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: VMs: Reducing the VMS to a stream of grouped glyphs...?



Hi everyone,

Here's a list of Perl regular expressions that I ran (in order) over one transcription of the VMS text to test out some of the ideas I've been talking about recently (with results below). Please feel free to experiment with (and develop) these as you like, it's not exactly rocket science. :-)

        $dy +=  s/dy/_/g;
        $ol +=  s/ol/_/g;
        $or +=  s/or/_/g;
        $al +=  s/al/_/g;
        $ar +=  s/ar/_/g;
        $am +=  s/am/_/g;
        $om +=  s/om/_/g;
        $ee +=  s/ee/_/g;
        $cc +=  s/cc/_/g;
        $an +=  s/an/_/g;
        $ain += s/ain/_/g;
        $aiin +=        s/aiin/_/g;
        $aiiin +=       s/aiiin/_/g;
        $air += s/air/_/g;
        $aiir +=        s/aiir/_/g;
        $aiiir +=       s/aiiir/_/g;
        $on +=  s/on/_/g;
        $oin += s/oin/_/g;
        $oiin +=        s/oiin/_/g;
        $oiiin +=       s/oiiin/_/g;
        $oir += s/oir/_/g;
        $oiir +=        s/oiir/_/g;
        $oiiir +=       s/oiiir/_/g;
        $aiim +=        s/aiim/_/g;
        $qo +=  s/qo/_/g;
        $ofe += s/ofe/_/g;
        $of +=  s/of/_/g;
        $fe +=  s/fe/_/g;
        $oke += s/oke/_/g;
        $ok +=  s/ok/_/g;
        $ke +=  s/ke/_/g;
        $ope += s/ope/_/g;
        $op +=  s/op/_/g;
        $pe +=  s/pe/_/g;
        $ote += s/ote/_/g;
        $ot +=  s/ot/_/g;
        $te +=  s/te/_/g;
        $ocfhe +=       s/ocfhe/_/g;
        $ocfh +=        s/ocfh/_/g;
        $cfhe +=        s/cfhe/_/g;
        $cfh += s/cfh/_/g;
        $ockhe +=       s/ockhe/_/g;
        $ockh +=        s/ockh/_/g;
        $ckhe +=        s/ckhe/_/g;
        $ckh += s/ckh/_/g;
        $ocphe +=       s/ocphe/_/g;
        $ocph +=        s/ocph/_/g;
        $cphe +=        s/cphe/_/g;
        $cph += s/cph/_/g;
        $octhe +=       s/octhe/_/g;
        $octh +=        s/octh/_/g;
        $cthe +=        s/cthe/_/g;
        $cth += s/cth/_/g;
        $she += s/she/_/g;
        $sh +=  s/sh/_/g;
        $che += s/che/_/g;
        $ch +=  s/ch/_/g;
        $od +=  s/od/_/g;
        $os +=  s/os/_/g;
        $oe +=  s/oe/_/g;
        $fy +=  s/fy/_/g;
        $ky +=  s/ky/_/g;
        $py +=  s/py/_/g;
        $ty +=  s/ty/_/g;
        $yf +=  s/yf/_/g;
        $yk +=  s/yk/_/g;
        $yp +=  s/yp/_/g;
        $yt +=  s/yt/_/g;

        $o +=   s/o/o/g;
        $s +=   s/s/s/g;
        $d +=   s/d/d/g;
        $y +=   s/y/y/g;
        $f +=   s/f/f/g;
        $k +=   s/k/k/g;
        $p +=   s/p/p/g;
        $t +=   s/t/t/g;
        $a +=   s/a/a/g;
        $i +=   s/i/i/g;
        $c +=   s/c/c/g;
        $e +=   s/e/e/g;

Here are the results this returned (remainder not counted), slightly reformatted:-

$dy = 6894
$ol = 5571      $od = 1006      $os = 436       $oe = 121
$al = 3153
$am = 811
$om = 166
$ee = 4621
$cc = 0
$an = 116       $ain = 1333     $aiin = 3924    $aiiin = 73
$ar = 3364      $air = 585      $aiir = 111             $aiiir = 1
$on = 7 $oin = 10       $oiin = 146             $oiiin = 30
$or = 2655      $oir = 7        $oiir = 14              $oiiir = 0
$aiim = 13
$qo = 4997
$ofe = 0                $of = 114       $fe = 0
$oke = 577              $ok = 2407      $ke = 1201
$ope = 2                $op = 429       $pe = 3
$ote = 547              $ot = 2215      $te = 491
$ocfhe = 4              $ocfh = 13      $cfhe = 14      $cfh = 49
$ockhe = 39     $ockh = 101     $ckhe = 195     $ckh = 590
$ocphe = 8              $ocph = 21      $cphe = 55      $cph = 137
$octhe = 26     $octh = 99      $cthe = 162     $cth = 676
$she = 2099     $sh = 2408
$che = 4234     $ch = 6707
$fy = 20        $ky = 540       $py = 44        $ty = 264
$yf = 19        $yk = 526       $yp = 85        $yt = 432

$f = 251 $k = 4746 $p = 870 $t = 2018

$o = 3284
$s = 2178
$d = 5193
$y = 8862
$a = 1058
$i = 1010
$c = 99
$e = 752

While this is only one possible decomposition of the text into possible groups (or tokens), it's important to note that this is an entirely *different* exercise from looking at (say) the raw frequency of <ch>. For example, here there are 4234 <che>'s, leaving 6707 <ch>'s remaining in the stream - whereas a raw count of <ch> would give the total of both.

Cheers, .....Nick Pelling.....

______________________________________________________________________
To unsubscribe, send mail to majordomo@xxxxxxxxxxx with a body saying:
unsubscribe vms-list