[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Statistics on copy(-n) -- was dain daiin
I took the first long enough English text I found
on my hard disk (a 19-chapter article on scientology,
about 225k long). The results:
fq copy( -1) = 23
fq copy( -2) = 206
fq copy( -3) = 400
fq copy( -4) = 479
fq copy( -5) = 436
fq copy( -6) = 433
fq copy( -7) = 492
fq copy( -8) = 464
fq copy( -9) = 473
fq copy(-10) = 488
fq copy(-11) = 481
fq copy(-12) = 450
fq copy(-13) = 425
fq copy(-14) = 415
fq copy(-15) = 425
fq copy(-16) = 403
fq copy(-17) = 463
fq copy(-18) = 454
fq copy(-19) = 444
fq copy(-20) = 416
fq copy(-21) = 441
fq copy(-22) = 417
fq copy(-23) = 487
fq copy(-24) = 416
fq copy(-25) = 418
fq copy(-26) = 449
fq copy(-27) = 437
fq copy(-28) = 421
fq copy(-29) = 446
fq copy(-30) = 412
fq copy(-31) = 396
fq copy(-32) = 442
fq copy(-33) = 425
fq copy(-34) = 435
fq copy(-35) = 425
fq copy(-36) = 416
fq copy(-37) = 369
fq copy(-38) = 380
fq copy(-39) = 416
fq copy(-40) = 426
I am not surprised at all, knowing how
English. In a language like Malay
the frequency of copy(-1) would be much
higher, I guess.
In a language like Arabic, or Hebrew,
the frequency of copy(-2) would much higher
too, if the article is considered a
separate word.
Incidentally, I remember that duplicated words
are quite frequent in the VMS, like in Malay and,
toa lesser extent perhaps, Chinese.
Here is the program that did it, so that you
can check for bugs if the spirit so moves you.
It's written in Euphoria, which you
can download from there:
http://www.rapideuphoria.com/v20.htm
-------------------------------------------
function read(sequence fn) -- fn is the file name
integer f
sequence text
object buffer
text = ""
f = open(fn,"r")
if f then
while 1 do
buffer = gets(f)
if atom(buffer) then
exit
else text &= buffer
end if
end while
close(f)
end if
return text
end function
function code(integer ch)
if ch>='A' and ch<='Z' then
ch += 32
end if
if ch<'a' or ch>'z' then
ch = ' '
end if
return ch
end function
function split(sequence text)
-- split a continuous text into a sequence of words
sequence wordList, word
integer ch, ch0
wordList = {}
word = ""
ch0 = ' '
for i=1 to length(text) by 1 do
ch = code(text[i])
if ch = ' ' then
if length(word) then
wordList = append(wordList, word)
word = ""
end if
else word &= ch
ch0 = ch
end if
end for
if length(word) then
wordList = append(wordList, word)
end if
return wordList
end function
function count(sequence text, integer howfar)
sequence fq, word1, word2
integer stop
fq = repeat(0,howfar)
for i=1 to length(text) by 1 do
word1 = text[i]
stop = i+howfar
if stop>length(text) then
stop = length(text)
end if
for j=i+1 to stop by 1 do
word2 = text[j]
if equal(word1,word2) then
fq[j-i] += 1
end if
end for
end for
return fq
end function
sequence s, fq
integer f
-- just checking:
--s = split("Hello hello world, how\n\n are you")
s = split(read("n:\\dos_hda5\\sciento\\all.txt"))
---for i=1 to length(s) by 1 do
---puts(1,s[i]&'|')
---end for
fq = count(s,40)
f = open("fq","w")
for i=1 to length(fq) by 1 do
printf(f,"fq copy(%3d) = %6d\n",{-i,fq[i]})
printf(1,"fq copy(%3d) = %6d\n",{-i,fq[i]}) -- echo to
screen
end for
close(f)