[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: *****SPAM***** VMs: Pointing at spaces



Sorry for these possibly tangential remarks.  It seems to me that this
sort of consideration might help in considering what the VMs represents,
especially since it is considered to date to a period in Europe before
some of the conventions governing these things were established for
European languages.

On Sat, 26 Feb 2005, Jacques Guy wrote:
> First, the segmentation of continuous text into its constituent
> morphemes is a problem still unsolved.

I'd be inclined to distinguish between morphemes, logical (or syntactic)
words, and phonological words.  In general - not sure this is relevant to
the VMs situation.  Words are made up of one or more morphemes, e.g., cat
(1), cats (cat-s) (2), etc.  Logical words are the basic elements
manipulated by the syntax, e.g., 'does' and 'not', while phonological
words are the result of clitic processes that sometimes tack one logical
word onto another in a single phonological word, e.g., 'not' onto
preceding 'does' to make 'doesn't'.

A slight complication here is that orthographic words are yet another
thing. Orthographic systems (often just one per language, but not always)
don't always recognize logical and phonological words in any consistent
way.  E.g., in English one can write either 'does not' or 'doesn't' for
(pronounced) 'doesn't', and many patterns of enclisis (and proclisis) are
never recognized at all, e.g., the definite article 'the' is only
occasionally written as part of a following word, e.g., sometimes
'th'enchanted' or something like that in song lyrics.  I don't think
'a/an' is ever written that way.  Systematic use of apostrophe in writing
things like this is more or less a feature of modern Western European
scripts.

I think a lot about clisis because I deal with (Siouan) languages where
there are no traditional writing systems and so no immemorial rules for
how to handle such things orthographically.  Since there is heavy use of
enclitics and proclitics in Siouan languages the issue arises immediately:
write the (en)clitics always separately and end up with lots of mini-words
or write then all with their "base" word and end up with real monsters.
The usual approach is a sort of ad hoc scheme of when to insert spaces.
For example, one often writes a space before the future marker and plural
marker or information status markers, but not before the negative or
auxiliary verbs.

Another possibility is to write a special mark like "=" where to indicate
enclisis, though this is seldom done except by linguists in technical
contexts.

Problems occur because some enclitics cause changes in the preceding word,
and some "lexical words" - things you might enter in a dictionary -
contain clitic elements.  And there seem to be several layers of
enclitics.

> Next, "Insertion rules" vary from writing system to
> writing system.

It's more like suppression rules, in the terms above, though insertion
rules are also an issue.  Many ancient scripts didn't use spacing.  Or
some special symbol might be used in lieu of spacing.

This said, I'm out of my depth with Arabic!

> In Arabic for instance each letter "decides" if a space is needed. The
> letter "d" for instance can never connect to the next letter and so is
> always followed by a break.

Is the space between d and what follows in a single word the same as the
space between a word ending in d and a following word?

> Same with ' (the glottal stop). Some morphemes are prefixed, some
> suffixed, and that obscures the morpheme boundaries, e.g. 'al malak is
> written ' lmlk

Sounds like it the answer to my previous question is 'yes', though I'd
call al and malak (logical) words in this context.  It's true that they
are probably monomorphemic words.  Maybe not malak, depending on whether
we see it as mlk plus "singular" or malak with the default vocalism.

______________________________________________________________________
To unsubscribe, send mail to majordomo@xxxxxxxxxxx with a body saying:
unsubscribe vms-list