[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

AW: determining the word-break character in VMS





> " [Anders, Claus]  Jacques Guy  " wrote:
> 
> 
	[Anders, Claus]  > Off the top of my head, without calculating any
statistics, I
	[Anders, Claus]  > would say that in Hungarian the letter e is more
common than
	[Anders, Claus]  > word-breaks (e.g. egyeségedre!). And again, in
Arabic breaks
	[Anders, Claus]  > between letters do not correspond to word breaks,
thus anhar
	[Anders, Claus]  > "rivers" is written a-space-nha-space-r because a
(alif) 
	[Anders, Claus]  > cannot connect to the next letter. Likewise dar
"house" is
	[Anders, Claus]  > written d-space-a-space-r.

	[Anders, Claus]  > No, we cannot be sure at all.
	[Anders, Claus]  What I wanted to show with my calculations, that
with "." as word/token break the min/average/max word/token length is
consitently within range. If "e"/"o" (or any other char) would be the break
char, then max word/token length will be far to big.
	("o" as break char would produce a max length of 59, whereas "." has
words of max 13 char)
	Claus