[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: VMs: Re: Word distribution
On Saturday 06 Mar 2004 12:58 pm, Nick Pelling wrote:
> You must be extremely careful when interpreting rank frequency law graphs:
> what they're claiming is that, if you rank all the words in a text by their
> frequency, then their frequencies will generally tail off according to a
> certain kind of (logarithmically straight-line) way.
It is a power law (i.e. straight in a double logarithmic plot) with exponent
very close to -1.0
> However, the same is
> also broadly true of random texts (as Gabriel mentions and we know the VMs
> is, in many ways, more structured than random). This is therefore
> problematic to draw conclusions (especially as to "languageness") from. You
> must similarly be careful when interpreting number frequency law graphs.
I believe that the reasons for Zipf's law in random texts have little to do
with the case of natural languages. Reading Wentian Li's paper(s), you will
immediately realise that in random texts (where the space is just another
character), the probability of finding *very* long words decays much slower
than for natural languages. However, ridiculously long words do not happen in
languages (Ok, Jacques, I am prepared for some examples :-) Shall I say
"unlikely" instead ?).
If one generates a random text and looks at the word and token length
distributions, these are *very* different from those in a natural language.
See figure 1b in the preprint of my Cryptologia paper, the curve called
Forced single spaces (this is a random text with the same space proportion as
the vms.
> What are Zipf's Laws all about in natural language? FWIW, I believe they
> reflect three different kinds of mechanisms, which have different
> (overlapping) degrees of usefulness (and hence frequencies):
> (1) syntactic infrastructure (words like "the", "and" etc);
> (2) global relevance (signifiers reused globally to explain/describe
> different things); and
> (3) local relevance (signifiers reused locally in a narrative to provide
> dramatic structure).
Yes, there has been some debate about this and how to draw those limits.
Andras Kornai has published a very interesting paper about this precise
problem (the mid-range words) (it is in his website).
> The good thing about Zipf's Laws is that they allow a kind of comparison
> between radically different texts: but the bad thing about them is they
> don't tell you about actual instance count per se, because those kinds of
> things are (for the most part) abstracted out as part of the process.
Note that one can also measure "distances" between Zipf's ranks (See Havlin's
paper it is referenced in my page on Zipf's laws). Although the plot is rank
and frequency, you know what word has which rank, and so comparison between
ranks is possible.
Despite all this, one has to keep in mind that one may end up with a Zipf's
distribution for a reason other than the vms being meaningful or structured
in a "language-like" fashion. So while it should be noticed, it is not a
proof of "languageness" as you said.
Something that I am quite uneasy about is that we should expect to find some
grammatical constructs, but this has not been very successful (or the search
has not been very throrough, I am not sure which one).
> I stand by my assertion (though it chimes with my own experience, I don't
> believe I originated it?) that the instance count of Voynichese words seems
> generally low compared with natural languages: and I also don't believe
> that Zipf's Laws are the right way to test this assertion.
If you think a bit more about this, you will realise that the number of
different words in a corpus which follows Zipf's law is the approximately
expected number for that particular corpus size. In other words, if it
follows Zipf's law, then the relative frequencies and the lexicon size are
more or less what you expect in other natural languages.
As Rene pointed out, if a language follows Z' law then the increase of lexicon
size with corpus size follows a particular pattern (which I seem to
remember is also a power law, but I would apreciate to be corrected if that
is not the case).
Cheers,
Gabriel
______________________________________________________________________
To unsubscribe, send mail to majordomo@xxxxxxxxxxx with a body saying:
unsubscribe vms-list