[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: VMs: wordlength persistence
I felt like extracting the same statistics from another text.
Writing the program was surprisingly easy. The first long enough
text I found in my mess was Caesar's "De Bello Gallico".
So here are the results:
N mean s.d.
1 245 6.64 2.191
2 3601 6.02 2.765
3 2588 5.94 2.828
4 2766 5.95 2.847
5 3349 6.08 2.925
6 3421 6.10 2.895
7 3454 6.18 3.005
8 2623 6.24 2.995
9 2128 6.01 3.026
10 1567 5.87 2.927
11 1040 6.02 2.932
12 597 6.01 2.861
13 290 5.70 2.779
14 116 5.94 3.069
15 52 5.46 3.016
16 13 7.23 3.332
17 4 6.75 2.278
18 0 0.00 0.000
19 0 0.00 0.000
20 2 6.50 2.500
The leftmost column is the length, the next the number of
words of that length (N), next is the mean length of the
next word, and finally the standard deviation.
Lovely. But the question remains: what to make out of all
that? I don't think those figures are terribly informative.
The actual table of frequencies might be more interesting
(each line is 82 characters long, make sure they do not
wrap around):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
1 7 11 4 9 19 55 63 43 17 4 10 1 2 0 0 0 0 0 0 0
2 33 366 390 393 432 485 434 382 260 204 111 64 30 11 4 2 0 0 0 0
3 23 329 247 262 369 303 327 245 162 149 85 57 13 8 8 1 0 0 0 0
4 35 364 254 275 349 343 357 234 228 147 86 53 28 12 0 1 0 0 0 0
5 24 452 278 376 371 390 436 309 267 184 129 68 37 19 8 1 0 0 0 0
6 31 458 299 316 377 455 396 343 287 212 135 58 36 10 7 1 0 0 0 0
7 32 459 304 321 396 390 448 322 248 222 150 86 49 17 6 4 0 0 0 0
8 20 330 231 214 317 332 332 249 202 158 108 74 30 15 7 1 1 0 0 2
9 19 303 203 224 257 244 244 172 166 115 75 58 27 11 6 2 2 0 0 0
10 9 238 167 158 198 160 175 150 127 71 57 31 16 6 4 0 0 0 0 0
11 8 137 107 109 119 124 121 94 80 48 55 22 12 2 2 0 0 0 0 0
12 2 76 57 63 82 78 66 46 48 38 18 13 5 5 0 0 0 0 0 0
13 0 48 29 31 37 35 40 18 24 10 9 6 3 0 0 0 0 0 0 0
14 1 17 12 9 16 20 9 10 6 2 8 5 0 0 0 0 1 0 0 0
15 1 12 4 4 8 5 5 4 2 3 3 0 1 0 0 0 0 0 0 0
16 0 1 1 1 2 1 1 1 2 0 1 1 1 0 0 0 0 0 0 0
17 0 0 1 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0
18 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
19 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
20 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
How to read it? Row 1 is words one letter long, row 2 words two letters long,
and so on. So row 1 shows that 1-letter words are followed 7 times by 1-letter
words, 11 times by 2-letter words, 4 times by 3-letters words, 9 times by 4-letter
words, and so on.
What comes immediately to mind is: submit that to the chi-squared test.
I did not do it because I thought: "So what?". It seemed meaningless to
me until I had formulated a meaningful hypothesis. I won't go into that now.
One thing, though, is interesting. It answers Nick Pelling's question:
(3) What about in reverse? (ie length of words preceding 2-letter words etc)
To get the reverse case you just swap rows and columns. The means and
standard deviations will differ of course, but chi-squared will not
differ one iota. Which means that the significance of the distribution
is exactly the same, whichever direction you consider.
For your amusement I append the source code (it's not great programming,
I might have spent at the most 15 minutes on it):
include mylib.e
type letter(object ch)
if sequence(ch) then
return 0
end if
return (ch>='A' and ch<='Z')
or (ch>='a' and ch<='z')
end type
function Words2Lengths(sequence text)
-- each word is replaced by its length
-- e.g. "hello wall!" -> {5,4}
integer ch0, ch, n
sequence len
len = {}
ch0 = 0 -- not a letter
n = 0
for i=1 to length(text) by 1 do
ch = letter(text[i])
if ch and ch0 then
n += 1
elsif ch then -- beginning of new word
if n then
len = append(len,n)
end if
n = 1
end if
ch0 = ch
end for
if n then
len = append(len,n)
end if
return len
end function
function max(sequence list)
integer mx
mx = 0
for i=1 to length(list) by 1 do
if list[i]>mx then
mx = list[i]
end if
end for
return mx
end function
function fq(sequence list)
integer maxlen
sequence table
maxlen = max(list)
table = repeat(repeat(0,maxlen), maxlen)
for i=2 to length(list) by 1 do
table[list[i-1]][list[i]] += 1
end for
return table
end function
function sumz(sequence list)
atom sx, sx2, n, fq
sx = 0
sx2 = 0
n = 0
for x= 1 to length(list) by 1 do
fq = list[x]
if fq then
sx += x*fq
sx2+= x*x*fq
n += fq
end if
end for
return {n,sx,sx2}
end function
function sdev(sequence sumz) -- n, sumx, sumx2
atom n, sx, sx2, mean, dev
n = sumz[1]
sx = sumz[2]
sx2= sumz[3]
if n = 0 then
return {0,0,0}
end if
mean = sx/n
dev = sqrt(sx2/n-mean*mean)
return {n, mean, dev}
end function
function table2sumz(sequence table)
for i=1 to length(table) by 1 do
table[i] = sumz(table[i])
end for
return table
end function
sequence fn
sequence text, len, t, z
integer fh
text = ""
for i=1 to 5 by 1 do
fn = sprintf("..\\latin\\gallic%d.txt",i)
puts(1,fn&"\n")
text &= Read(fn, "1")
end for
text = StripTags(text)
len = Words2Lengths(text)
t = fq(len)
fh = open("gallico.fq", "w")
for row=1 to length(t) by 1 do
xprintf({1,fh}, "%2d", row)
for col=1 to length(t) by 1 do
xprintf({1,fh},"%4d",t[row][col])
end for
xputs({1,fh},"\n")
end for
close(fh)
t = table2sumz(t)
puts(1,"\n")
fh = open("gallico.z","w")
for i=1 to length(t) by 1 do
z = sdev(t[i])
xprintf({1,fh},"%2d %4d %6.2f %6.3f\n",{i,z[1],z[2],z[3]})
end for
close(fh)
______________________________________________________________________
To unsubscribe, send mail to majordomo@xxxxxxxxxxx with a body saying:
unsubscribe vms-list