[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
AW: determining the word-break character in VMS
> " [Anders, Claus] Jacques Guy " wrote:
>
>
[Anders, Claus] > Off the top of my head, without calculating any
statistics, I
[Anders, Claus] > would say that in Hungarian the letter e is more
common than
[Anders, Claus] > word-breaks (e.g. egyeségedre!). And again, in
Arabic breaks
[Anders, Claus] > between letters do not correspond to word breaks,
thus anhar
[Anders, Claus] > "rivers" is written a-space-nha-space-r because a
(alif)
[Anders, Claus] > cannot connect to the next letter. Likewise dar
"house" is
[Anders, Claus] > written d-space-a-space-r.
[Anders, Claus] > No, we cannot be sure at all.
[Anders, Claus] What I wanted to show with my calculations, that
with "." as word/token break the min/average/max word/token length is
consitently within range. If "e"/"o" (or any other char) would be the break
char, then max word/token length will be far to big.
("o" as break char would produce a max length of 59, whereas "." has
words of max 13 char)
Claus