[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Statistics on copy(-n) -- was dain daiin



I took the first long enough English text I found
on my hard disk (a 19-chapter article on scientology,
about 225k long). The results:

fq copy( -1) =     23
fq copy( -2) =    206
fq copy( -3) =    400
fq copy( -4) =    479
fq copy( -5) =    436
fq copy( -6) =    433
fq copy( -7) =    492
fq copy( -8) =    464
fq copy( -9) =    473
fq copy(-10) =    488
fq copy(-11) =    481
fq copy(-12) =    450
fq copy(-13) =    425
fq copy(-14) =    415
fq copy(-15) =    425
fq copy(-16) =    403
fq copy(-17) =    463
fq copy(-18) =    454
fq copy(-19) =    444
fq copy(-20) =    416
fq copy(-21) =    441
fq copy(-22) =    417
fq copy(-23) =    487
fq copy(-24) =    416
fq copy(-25) =    418
fq copy(-26) =    449
fq copy(-27) =    437
fq copy(-28) =    421
fq copy(-29) =    446
fq copy(-30) =    412
fq copy(-31) =    396
fq copy(-32) =    442
fq copy(-33) =    425
fq copy(-34) =    435
fq copy(-35) =    425
fq copy(-36) =    416
fq copy(-37) =    369
fq copy(-38) =    380
fq copy(-39) =    416
fq copy(-40) =    426

I am not surprised at all, knowing how
English.  In a language like Malay
the frequency of copy(-1) would be much
higher, I guess.

In a language like Arabic, or Hebrew,
the frequency of copy(-2) would much higher
too, if the article is considered a 
separate word.  

Incidentally, I remember that duplicated words
are quite frequent in the VMS, like in Malay and,
toa lesser extent perhaps, Chinese. 

Here is the program that did it, so that you
can check for bugs if the spirit so moves you. 

It's written in Euphoria, which you
can download from  there:

http://www.rapideuphoria.com/v20.htm

-------------------------------------------

function read(sequence fn) -- fn is the file name
integer f
sequence text
object buffer
    text = ""
    f = open(fn,"r")
    if f then
	while 1 do
	    buffer = gets(f)
	    if atom(buffer) then
		exit
		else text &= buffer
	    end if
	end while
	close(f)
    end if
    return text
end function


function code(integer ch)
    if ch>='A' and ch<='Z' then
	ch += 32
    end if
    if ch<'a' or ch>'z' then
	ch = ' '
    end if
    return ch
end function

function split(sequence text)
-- split a continuous text into a sequence of words
sequence wordList, word
integer ch, ch0
    wordList = {}
    word = ""
    ch0 = ' '
    for i=1  to length(text) by 1 do
	ch = code(text[i])
	if ch = ' ' then
	    if length(word) then
		wordList = append(wordList, word)
		word = ""
	    end if
	else word &= ch
	ch0 = ch
	end if
    end for
    if length(word) then
	wordList = append(wordList, word)
    end if
    return wordList
end function

function count(sequence text, integer  howfar)
sequence fq, word1, word2
integer stop
    fq = repeat(0,howfar)
    for i=1 to length(text)  by 1 do
	word1 = text[i]
	stop = i+howfar
	if stop>length(text) then
	    stop = length(text)
	end if
	for j=i+1  to stop by 1 do
	    word2 = text[j]
	    if equal(word1,word2) then
		fq[j-i] += 1
	    end if
	end for
    end for
    return fq
end function

sequence s, fq
integer f

-- just checking:
--s = split("Hello hello world, how\n\n are you")

s = split(read("n:\\dos_hda5\\sciento\\all.txt"))

---for i=1  to length(s) by 1 do
    ---puts(1,s[i]&'|')
---end for

fq = count(s,40)

f = open("fq","w")

for i=1  to length(fq) by 1 do
    printf(f,"fq copy(%3d) = %6d\n",{-i,fq[i]})
    printf(1,"fq copy(%3d) = %6d\n",{-i,fq[i]}) -- echo to 
screen
end for

close(f)