[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: VMs: wordlength persistence



I felt like extracting the same statistics from another text.
Writing the program was surprisingly easy. The first long enough
text I found in my mess was Caesar's "De Bello Gallico".

So here are the results:

     N    mean  s.d.
 1  245   6.64  2.191
 2 3601   6.02  2.765
 3 2588   5.94  2.828
 4 2766   5.95  2.847
 5 3349   6.08  2.925
 6 3421   6.10  2.895
 7 3454   6.18  3.005
 8 2623   6.24  2.995
 9 2128   6.01  3.026
10 1567   5.87  2.927
11 1040   6.02  2.932
12  597   6.01  2.861
13  290   5.70  2.779
14  116   5.94  3.069
15   52   5.46  3.016
16   13   7.23  3.332
17    4   6.75  2.278
18    0   0.00  0.000
19    0   0.00  0.000
20    2   6.50  2.500

The leftmost column is the length, the next the number of
words of that length (N), next is the mean length of the
next word, and finally the standard deviation.

Lovely. But the question remains: what to make out of all
that? I don't think those figures are terribly informative.

The actual table of frequencies might be more interesting
(each line is 82 characters long, make sure they do not 
wrap around):

     1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20
 1   7  11   4   9  19  55  63  43  17   4  10   1   2   0   0   0   0   0   0   0
 2  33 366 390 393 432 485 434 382 260 204 111  64  30  11   4   2   0   0   0   0
 3  23 329 247 262 369 303 327 245 162 149  85  57  13   8   8   1   0   0   0   0
 4  35 364 254 275 349 343 357 234 228 147  86  53  28  12   0   1   0   0   0   0
 5  24 452 278 376 371 390 436 309 267 184 129  68  37  19   8   1   0   0   0   0
 6  31 458 299 316 377 455 396 343 287 212 135  58  36  10   7   1   0   0   0   0
 7  32 459 304 321 396 390 448 322 248 222 150  86  49  17   6   4   0   0   0   0
 8  20 330 231 214 317 332 332 249 202 158 108  74  30  15   7   1   1   0   0   2
 9  19 303 203 224 257 244 244 172 166 115  75  58  27  11   6   2   2   0   0   0
10   9 238 167 158 198 160 175 150 127  71  57  31  16   6   4   0   0   0   0   0
11   8 137 107 109 119 124 121  94  80  48  55  22  12   2   2   0   0   0   0   0
12   2  76  57  63  82  78  66  46  48  38  18  13   5   5   0   0   0   0   0   0
13   0  48  29  31  37  35  40  18  24  10   9   6   3   0   0   0   0   0   0   0
14   1  17  12   9  16  20   9  10   6   2   8   5   0   0   0   0   1   0   0   0
15   1  12   4   4   8   5   5   4   2   3   3   0   1   0   0   0   0   0   0   0
16   0   1   1   1   2   1   1   1   2   0   1   1   1   0   0   0   0   0   0   0
17   0   0   1   0   0   0   1   1   1   0   0   0   0   0   0   0   0   0   0   0
18   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
19   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
20   0   0   0   1   0   0   0   0   1   0   0   0   0   0   0   0   0   0   0   0
     1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20

How to read it? Row 1 is words one letter long,  row 2 words two letters long,
and so on. So row 1 shows that 1-letter words are followed 7 times by 1-letter
words, 11 times by 2-letter words, 4 times by 3-letters words, 9 times by 4-letter
words, and so on.

What comes immediately to mind is: submit that to the chi-squared test. 
I did not do it because I thought: "So what?". It seemed meaningless to
me until I had formulated a meaningful hypothesis. I won't go into that now.
One thing, though, is interesting. It answers Nick Pelling's question:

(3) What about in reverse? (ie length of words preceding 2-letter words etc)

To get the reverse case you just swap rows and columns. The means and
standard deviations will differ of course, but chi-squared will not
differ one iota. Which means that the significance of the distribution
is exactly the same, whichever direction you consider.

For your amusement I append the source code (it's not great programming,
I might have spent at the most 15 minutes on it):

include mylib.e

type letter(object ch)
    if sequence(ch) then
	return 0
    end if
    return (ch>='A' and ch<='Z')
	or (ch>='a' and ch<='z')
end type


function Words2Lengths(sequence text)
-- each word is replaced by its length
-- e.g. "hello wall!" -> {5,4}
integer ch0, ch, n
sequence len

    len = {}
    ch0 = 0 -- not a letter
    n   = 0

    for i=1  to length(text) by 1 do
	ch = letter(text[i])
	if ch and ch0 then
	    n += 1
	elsif ch then -- beginning of new word
	    if n then
		len = append(len,n)
	    end if
	    n  = 1
	end if
	ch0 = ch
    end for
    if n then
	len = append(len,n)
    end if

    return len
end function

function max(sequence list)
integer mx
    mx = 0
    for i=1  to length(list) by 1 do
	if list[i]>mx then
	    mx = list[i]
	end if
    end for
    return mx
end function

function fq(sequence list)
integer maxlen
sequence table
    maxlen = max(list)
    table = repeat(repeat(0,maxlen), maxlen)
    
    for i=2  to length(list) by 1 do
	table[list[i-1]][list[i]] += 1
    end for
    return table
end function

function sumz(sequence list)
atom sx, sx2, n, fq
    sx = 0
    sx2 = 0
    n = 0
    for x= 1 to length(list)  by 1 do
	fq = list[x]
	if fq then
	    sx += x*fq
	    sx2+= x*x*fq
	    n += fq
	end if
    end for

    return {n,sx,sx2}
end function

function sdev(sequence sumz) -- n, sumx, sumx2
atom n, sx, sx2, mean, dev
    n = sumz[1]
    sx = sumz[2]
    sx2= sumz[3]
    if n = 0 then
	return {0,0,0}
    end if
    mean = sx/n
    dev  = sqrt(sx2/n-mean*mean)
    return {n, mean, dev}
end function

function table2sumz(sequence table)
    for i=1  to length(table) by 1 do
	table[i] = sumz(table[i])
    end for
    return table
end function


sequence fn
sequence text, len, t, z
integer fh

text = ""
for i=1  to 5 by 1 do
    fn = sprintf("..\\latin\\gallic%d.txt",i)
    puts(1,fn&"\n")
    text &= Read(fn, "1")
end for


text = StripTags(text)

len = Words2Lengths(text)


t = fq(len)
fh = open("gallico.fq", "w")

for row=1  to length(t) by 1 do
    xprintf({1,fh}, "%2d", row)
    for col=1  to length(t)  by 1 do
	xprintf({1,fh},"%4d",t[row][col])
    end for
    xputs({1,fh},"\n")
end for
close(fh)

t = table2sumz(t)

puts(1,"\n")

fh = open("gallico.z","w")

for i=1   to length(t) by 1 do
    z = sdev(t[i])
   xprintf({1,fh},"%2d %4d %6.2f %6.3f\n",{i,z[1],z[2],z[3]})
end for

close(fh)   




______________________________________________________________________
To unsubscribe, send mail to majordomo@xxxxxxxxxxx with a body saying:
unsubscribe vms-list