[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

VMs: Re: VMS Word context similarities



On Wed, 7 Sep 2005, Marke Fincher wrote:
> I wrote a crude program which given a large input text tries to identify
> groups of words which occur in a similar context.

This is interesting, and I think it should be indendent of the analysis of
the text into characters, since, I think, we are mostly agreed on spaces
as some sort of delimiter (except where spacing is ambiguous).

I suppose that to test this one would have to intuit what some given set
comprised, functionally, and test that hypothesis.

> (ar,or)
> (kor,sor,okor)
> (otchol,tchey)
> (ol,chol,chedy,shedy,qokeey,qokeedy,qokedy)
> (dar,qokaiin,okaiin,qokai!n,okal,qokar,saiin,otar)
> (qokain,otai!n)
> (r,l,sol)
> (tar,ykar)
> (shecthy,olchedy)
> (ched,lkar)

A smallish set might be articles.  A largish set might be forms of the
verb 'to be'.

Incidentally, what is "!"?

This approach suggests to me that one approach might be to look for
numerical notation or notations.  One assumes that numerical material
would consist of a mixture of common short sequences (1, 2, .., 10, 11,
...) and more or less unique long sequences (365, 1542, $11.59), plus
perhaps some more frequent long sequences (1000, 120).  Presumably the
number of distinct tokens would be consistent with 9 or 10 or with i v x l
c d m.  Certain words designating measuring units (pounds, days, feet)
would occur near numerals.  Things that seemed calendrical or
astronomincal would probably include numerals.  Linguistically unusual
patterns, like multiple repetition or randomness would probably suggest
the presence of numbers.

> ...but no large groups.

Which suggests that the text doesn't represent syntax of the (or in the)
usual way if space-delimited items are words.  Much depends on your
algorithm, which you haven't elucidated for us, but intuitively one would
expect to find a mixture of set of common function words (things that
occur frequently at set positions in a phrase or clause, e.g.,
"(the/a/an/this) NOUN" or "NOUN (and/or) NOUN") and less common sense
words, though the item "god, yahweh" in your Biblical control set
represents a frequent sense-word collocation rather than a function word
set.  As your control case suggests, the algorithm might conflate initials
or medials of several lingusitically independent patterns.

So, given the lack of large sets, perhaps we should be (1) looking for a
way of reparsing the material that leads to more typically linguistic
patterns, i.e., not a "space delimited token equals word" mapping, or (2)
looking for a kind of content that leads to similar lingusitically
atypical patterns.  Along the second line of approach, what happens if you
use non-linguistic or a-typical linguistic control texts?  I'm not sure
what to suggest.  Verse? (unusual, strained syntax)  Maybe a phone book or
set of class notes?  Field guides?  Horoscopes?  (various kinds of concise
or oblique notations)  What happens if you apply the algorithm to various
transformations of the Bible, e.g., encrypt it with various simple
approaches, or delete word boundaries and reinsert them at random or by
rule.  You'd have to encrypt with something other than a simple letter
substitution, of course.
______________________________________________________________________
To unsubscribe, send mail to majordomo@xxxxxxxxxxx with a body saying:
unsubscribe vms-list