[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: VMs: Reducing the VMS to a stream of grouped glyphs...?

To: vms-list@xxxxxxxxxxx
Subject: Re: VMs: Reducing the VMS to a stream of grouped glyphs...?
From: Nick Pelling <incoming@xxxxxxxxxxxxxxxxx>
Date: Mon, 03 Mar 2003 23:39:38 +0000
In-reply-to: <5.1.0.14.0.20030301130616.03c0b320@pop3.blueyonder.co.uk>
References: <3E5E753C.A96DFC05@amu.edu.pl> <NIEMKNCNNHJOGEJLILNMMEOBCFAA.John@morewood.net> <000701c2dae2$3a87af30$5785590c@YOUREA216FD09B> <3E5915B8.860723E@amu.edu.pl> <003201c2dbb3$76ba1250$f8ae5e0c@YOUREA216FD09B> <3E5D2EE5.598C50B1@amu.edu.pl> <OE15UcXzxxQLj9KxJJS00005c5b@hotmail.com>
Reply-to: vms-list@xxxxxxxxxxx
Sender: owner-vms-list@xxxxxxxxxxx

Hi everyone,

Here's a list of Perl regular expressions that I ran (in order) over one transcription of the VMS text to test out some of the ideas I've been talking about recently (with results below). Please feel free to experiment with (and develop) these as you like, it's not exactly rocket science. :-)

        $dy +=  s/dy/_/g;
        $ol +=  s/ol/_/g;
        $or +=  s/or/_/g;
        $al +=  s/al/_/g;
        $ar +=  s/ar/_/g;
        $am +=  s/am/_/g;
        $om +=  s/om/_/g;
        $ee +=  s/ee/_/g;
        $cc +=  s/cc/_/g;
        $an +=  s/an/_/g;
        $ain += s/ain/_/g;
        $aiin +=        s/aiin/_/g;
        $aiiin +=       s/aiiin/_/g;
        $air += s/air/_/g;
        $aiir +=        s/aiir/_/g;
        $aiiir +=       s/aiiir/_/g;
        $on +=  s/on/_/g;
        $oin += s/oin/_/g;
        $oiin +=        s/oiin/_/g;
        $oiiin +=       s/oiiin/_/g;
        $oir += s/oir/_/g;
        $oiir +=        s/oiir/_/g;
        $oiiir +=       s/oiiir/_/g;
        $aiim +=        s/aiim/_/g;
        $qo +=  s/qo/_/g;
        $ofe += s/ofe/_/g;
        $of +=  s/of/_/g;
        $fe +=  s/fe/_/g;
        $oke += s/oke/_/g;
        $ok +=  s/ok/_/g;
        $ke +=  s/ke/_/g;
        $ope += s/ope/_/g;
        $op +=  s/op/_/g;
        $pe +=  s/pe/_/g;
        $ote += s/ote/_/g;
        $ot +=  s/ot/_/g;
        $te +=  s/te/_/g;
        $ocfhe +=       s/ocfhe/_/g;
        $ocfh +=        s/ocfh/_/g;
        $cfhe +=        s/cfhe/_/g;
        $cfh += s/cfh/_/g;
        $ockhe +=       s/ockhe/_/g;
        $ockh +=        s/ockh/_/g;
        $ckhe +=        s/ckhe/_/g;
        $ckh += s/ckh/_/g;
        $ocphe +=       s/ocphe/_/g;
        $ocph +=        s/ocph/_/g;
        $cphe +=        s/cphe/_/g;
        $cph += s/cph/_/g;
        $octhe +=       s/octhe/_/g;
        $octh +=        s/octh/_/g;
        $cthe +=        s/cthe/_/g;
        $cth += s/cth/_/g;
        $she += s/she/_/g;
        $sh +=  s/sh/_/g;
        $che += s/che/_/g;
        $ch +=  s/ch/_/g;
        $od +=  s/od/_/g;
        $os +=  s/os/_/g;
        $oe +=  s/oe/_/g;
        $fy +=  s/fy/_/g;
        $ky +=  s/ky/_/g;
        $py +=  s/py/_/g;
        $ty +=  s/ty/_/g;
        $yf +=  s/yf/_/g;
        $yk +=  s/yk/_/g;
        $yp +=  s/yp/_/g;
        $yt +=  s/yt/_/g;

        $o +=   s/o/o/g;
        $s +=   s/s/s/g;
        $d +=   s/d/d/g;
        $y +=   s/y/y/g;
        $f +=   s/f/f/g;
        $k +=   s/k/k/g;
        $p +=   s/p/p/g;
        $t +=   s/t/t/g;
        $a +=   s/a/a/g;
        $i +=   s/i/i/g;
        $c +=   s/c/c/g;
        $e +=   s/e/e/g;

Here are the results this returned (remainder not counted), slightly reformatted:-

$dy = 6894
$ol = 5571      $od = 1006      $os = 436       $oe = 121
$al = 3153
$am = 811
$om = 166
$ee = 4621
$cc = 0
$an = 116       $ain = 1333     $aiin = 3924    $aiiin = 73
$ar = 3364      $air = 585      $aiir = 111             $aiiir = 1
$on = 7 $oin = 10       $oiin = 146             $oiiin = 30
$or = 2655      $oir = 7        $oiir = 14              $oiiir = 0
$aiim = 13
$qo = 4997
$ofe = 0                $of = 114       $fe = 0
$oke = 577              $ok = 2407      $ke = 1201
$ope = 2                $op = 429       $pe = 3
$ote = 547              $ot = 2215      $te = 491
$ocfhe = 4              $ocfh = 13      $cfhe = 14      $cfh = 49
$ockhe = 39     $ockh = 101     $ckhe = 195     $ckh = 590
$ocphe = 8              $ocph = 21      $cphe = 55      $cph = 137
$octhe = 26     $octh = 99      $cthe = 162     $cth = 676
$she = 2099     $sh = 2408
$che = 4234     $ch = 6707
$fy = 20        $ky = 540       $py = 44        $ty = 264
$yf = 19        $yk = 526       $yp = 85        $yt = 432

$f = 251 $k = 4746 $p = 870 $t = 2018

$o = 3284
$s = 2178
$d = 5193
$y = 8862
$a = 1058
$i = 1010
$c = 99
$e = 752

While this is only one possible decomposition of the text into possible groups (or tokens), it's important to note that this is an entirely *different* exercise from looking at (say) the raw frequency of <ch>. For example, here there are 4234 <che>'s, leaving 6707 <ch>'s remaining in the stream - whereas a raw count of <ch> would give the total of both.

Cheers, .....Nick Pelling.....

______________________________________________________________________
To unsubscribe, send mail to majordomo@xxxxxxxxxxx with a body saying:
unsubscribe vms-list

Follow-Ups:
- Re: VMs: Reducing the VMS to a stream of grouped glyphs...?
  - From: Petr Kazil

References:
- VMs: Reducing the VMS to a stream of grouped glyphs...?
  - From: Nick Pelling

Prev by Date: RE: VMs: Declaration of WAR against EVA
Next by Date: Re: VMs: Entropy, was WAR against EVA
Previous by thread: VMs: Reducing the VMS to a stream of grouped glyphs...?
Next by thread: Re: VMs: Reducing the VMS to a stream of grouped glyphs...?
Index(es):
- Date
- Thread