[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
AW: VMs: Character repetition
Ok here is my revides version, using relativ values:
4.62723 6 .-.
2.31618 1 e-e
2.0771 1 i-i
1.67688 7 o-o
1.46827 7 y-y
1.06746 6 h-h
0.760113 7 d-d
0.720149 7 a-a
0.67459 7 k-k
0.60622 6 c-c
0.304525 7 t-t
0.301424 6 l-l
0.290138 7 q-q
0.216604 7 n-n
0.212577 3 r-r
0.188137 6 s-s
0.0423617 7 p-p
0.0101448 5 m-m
0.00559494 7 f-f
1st column: percentage of occurences regarding all character pairs in one line with the same distance
2nd: number of characters between any prior character in the line and the actual one
3rd: prior character and actual character.
The method is:
Take every line.Look at one character by one going from left to right.
Compute the distance (within the line) to every prior char and increase the distance value for these pair (there is a table with the pair and the distance as index).
After going through the whole VMS,divide the count of each pair/distance entry by the number of distance occurence:
This is the awk-script:
BEGIN {anz=0;code="-=.abcdefghijklmnopqrstuvwxyz"}
/.*/{
pos=0;
for (i in lp)
delete lp[i]
gsub("\{.*\}","");
for(i=1;i<=length($2);i++)
{
c=substr($2,i,1);
if(index(code,c)<1)
continue;
pos++;
for(j in lp)
{
fp=pos-lp[j];
pair=fp " " j "-" c ;
if(pair in count )
{
count[pair]=count[pair]+1;
}
else
{
count[pair]=1;
}
countfp[fp]=countfp[fp]+1;
}
lp[c]=pos;
}
anz=pos;
}
END {
c=0;
for(i in count)
{
split(i,aa," ");
print count[i]*100/countfp[aa[1]] " " i;
}
}
After that, I got a list of all char pair distances with relative frequency.Now I answer the following questions:which is the probabiltiy, that char 'x1' is followed by 'x2' or which chars are most likely to be found on position 1,2,3,4,5.. After a space.The above double char table is just a by-product.
Claus
-----Ursprüngliche Nachricht-----
Von: Lukas Palatinus [mailto:palat@xxxxxx]
Gesendet: Mittwoch, 15. September 2004 15:42
An: vms-list@xxxxxxxxxxx
Betreff: Re: VMs: Character repetition
Hi Claus,
> I found the following table quite interesting:
> Occurences distance char-pair
> 6862 6 .-.
> 5028 1 e-e
> 4509 1 i-i
> 2145 3 o-o
> 1837 7 y-y
> 1583 6 h-h
...
Please, can you give some more detailed description of what your
table means and how was it generated?
There are the most frequent distances between the specified pairs of
letters in the table? What about the distribution of the distances?
The table looks very interesting, but I think more explanation is
needed for it to be of real use for others.
Lukas
______________________________________________________________________
To unsubscribe, send mail to majordomo@xxxxxxxxxxx with a body saying:
unsubscribe vms-list
______________________________________________________________________
To unsubscribe, send mail to majordomo@xxxxxxxxxxx with a body saying:
unsubscribe vms-list