[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: VMs: Reducing the VMS to a stream of grouped glyphs...?
Hi everyone,
Here's a list of Perl regular expressions that I ran (in order) over one
transcription of the VMS text to test out some of the ideas I've been
talking about recently (with results below). Please feel free to experiment
with (and develop) these as you like, it's not exactly rocket science. :-)
$dy += s/dy/_/g;
$ol += s/ol/_/g;
$or += s/or/_/g;
$al += s/al/_/g;
$ar += s/ar/_/g;
$am += s/am/_/g;
$om += s/om/_/g;
$ee += s/ee/_/g;
$cc += s/cc/_/g;
$an += s/an/_/g;
$ain += s/ain/_/g;
$aiin += s/aiin/_/g;
$aiiin += s/aiiin/_/g;
$air += s/air/_/g;
$aiir += s/aiir/_/g;
$aiiir += s/aiiir/_/g;
$on += s/on/_/g;
$oin += s/oin/_/g;
$oiin += s/oiin/_/g;
$oiiin += s/oiiin/_/g;
$oir += s/oir/_/g;
$oiir += s/oiir/_/g;
$oiiir += s/oiiir/_/g;
$aiim += s/aiim/_/g;
$qo += s/qo/_/g;
$ofe += s/ofe/_/g;
$of += s/of/_/g;
$fe += s/fe/_/g;
$oke += s/oke/_/g;
$ok += s/ok/_/g;
$ke += s/ke/_/g;
$ope += s/ope/_/g;
$op += s/op/_/g;
$pe += s/pe/_/g;
$ote += s/ote/_/g;
$ot += s/ot/_/g;
$te += s/te/_/g;
$ocfhe += s/ocfhe/_/g;
$ocfh += s/ocfh/_/g;
$cfhe += s/cfhe/_/g;
$cfh += s/cfh/_/g;
$ockhe += s/ockhe/_/g;
$ockh += s/ockh/_/g;
$ckhe += s/ckhe/_/g;
$ckh += s/ckh/_/g;
$ocphe += s/ocphe/_/g;
$ocph += s/ocph/_/g;
$cphe += s/cphe/_/g;
$cph += s/cph/_/g;
$octhe += s/octhe/_/g;
$octh += s/octh/_/g;
$cthe += s/cthe/_/g;
$cth += s/cth/_/g;
$she += s/she/_/g;
$sh += s/sh/_/g;
$che += s/che/_/g;
$ch += s/ch/_/g;
$od += s/od/_/g;
$os += s/os/_/g;
$oe += s/oe/_/g;
$fy += s/fy/_/g;
$ky += s/ky/_/g;
$py += s/py/_/g;
$ty += s/ty/_/g;
$yf += s/yf/_/g;
$yk += s/yk/_/g;
$yp += s/yp/_/g;
$yt += s/yt/_/g;
$o += s/o/o/g;
$s += s/s/s/g;
$d += s/d/d/g;
$y += s/y/y/g;
$f += s/f/f/g;
$k += s/k/k/g;
$p += s/p/p/g;
$t += s/t/t/g;
$a += s/a/a/g;
$i += s/i/i/g;
$c += s/c/c/g;
$e += s/e/e/g;
Here are the results this returned (remainder not counted), slightly
reformatted:-
$dy = 6894
$ol = 5571 $od = 1006 $os = 436 $oe = 121
$al = 3153
$am = 811
$om = 166
$ee = 4621
$cc = 0
$an = 116 $ain = 1333 $aiin = 3924 $aiiin = 73
$ar = 3364 $air = 585 $aiir = 111 $aiiir = 1
$on = 7 $oin = 10 $oiin = 146 $oiiin = 30
$or = 2655 $oir = 7 $oiir = 14 $oiiir = 0
$aiim = 13
$qo = 4997
$ofe = 0 $of = 114 $fe = 0
$oke = 577 $ok = 2407 $ke = 1201
$ope = 2 $op = 429 $pe = 3
$ote = 547 $ot = 2215 $te = 491
$ocfhe = 4 $ocfh = 13 $cfhe = 14 $cfh = 49
$ockhe = 39 $ockh = 101 $ckhe = 195 $ckh = 590
$ocphe = 8 $ocph = 21 $cphe = 55 $cph = 137
$octhe = 26 $octh = 99 $cthe = 162 $cth = 676
$she = 2099 $sh = 2408
$che = 4234 $ch = 6707
$fy = 20 $ky = 540 $py = 44 $ty = 264
$yf = 19 $yk = 526 $yp = 85 $yt = 432
$f = 251 $k = 4746 $p = 870 $t = 2018
$o = 3284
$s = 2178
$d = 5193
$y = 8862
$a = 1058
$i = 1010
$c = 99
$e = 752
While this is only one possible decomposition of the text into possible
groups (or tokens), it's important to note that this is an entirely
*different* exercise from looking at (say) the raw frequency of <ch>. For
example, here there are 4234 <che>'s, leaving 6707 <ch>'s remaining in the
stream - whereas a raw count of <ch> would give the total of both.
Cheers, .....Nick Pelling.....
______________________________________________________________________
To unsubscribe, send mail to majordomo@xxxxxxxxxxx with a body saying:
unsubscribe vms-list