[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: On the word length distribution
> Jorge, do I understand correctly that you used the Currier
> alphabet 'as is'?
Actually the alphabet I used represented <Ch>>, <Sh>, <CTh>, etc as
single characters, but each <e> and <i> as an isolated character.
(Thus it is not strictly Currier's --- although he had a code
for <i>, he would use compound codes for <iin> etc.)
To clarify such matters, I have added to the page links to the
relevant files showing the list of words and the factorization
into "letters" which I used.
> This is of course a bit of a problem since it
> seems reasonable to assume that the strings of C's should
> perhaps not represent multiple characters (or,
> conversely, the strings of I's should).
Yes. I did the same computation in terms of the elements of my word paradigm
(where single <e>s are attached to the preceding letter, double
and triple <e>s are counted as single letters, strings of <i> are attached to
the final letter). Alas, the resulting length distribution is not symmetric,
and does not fit a binomial distribution with any integer N.
> And the peculiar role of word-initial 4 is problematic too. Not
> by itself, but due to the fact that is is essentially always
> followed by an O.
There are all sorts of funny "phonological" rules going on. I am
looking at the set of all words with k letters, for various k, to see
whether there is any pattern that would explain the binomial shape. So
far it is a complete mystery. For instance, here are the 2-letter
words, with their occurrence counts in the (cleaned-up) text:
23 {k}{y}
6 {k}{o}
2 {k}{l}
1 {k}{Sh}
1 {k}{a}
1 {k}{e}
115 {CTh}{y} 40 {CKh}{y} 14 {CPh}{y} 6 {CFh}{y}
16 {CTh}{o} 5 {CKh}{o} 2 {CPh}{o}
1 {CTh}{a}
1 {CTh}{s} 1 {CKh}{s} 1 {CFh}{s}
1 {CTh}{d}
1 {CTh}{l}
1 {Ch}{k}
3 {Ch}{CKh}
4 {Ch}{CTh}
1 {Ch}{e} 27 {Sh}{e}
152 {Ch}{y} 102 {Sh}{y}
68 {Ch}{o} 126 {Sh}{o}
1 {Ch}{a} 3 {Sh}{a}
6 {Ch}{d} 6 {Sh}{d}
26 {Ch}{l} 3 {Sh}{l}
9 {Ch}{r} 2 {Sh}{r}
16 {Ch}{s} 3 {Sh}{s}
21 {o}{m} 87 {a}{m}
4 {a}{n}
7 {o}{d}
548 {o}{l} 270 {a}{l}
365 {o}{r} 360 {a}{r}
25 {o}{s} 1 {a}{s}
6 {o}{y} 1 {a}{y}
12 {o}{t}
5 {o}{k}
2 {o}{p}
1 {o}{f}
1 {o}{CKh}
1 {o}{Sh}
278 {d}{y} 13 {l}{y}
14 {d}{o} 15 {l}{o}
6 {d}{a}
20 {d}{l}
1 {d}{d} 4 {l}{d}
1 {d}{r} 11 {l}{r}
2 {d}{s} 10 {l}{s}
4 {d}{m} 1 {l}{m}
1 {l}{Ch}
1 {l}{Sh}
1 {l}{e}
1 {l}{k}
1 {l}{t}
1 {e}{l}
1 {e}{s}
1 {e}{y}
I can see the well-known partition of the alphabet into classes
(gallows, dealers, circles, benches), but obviously that is
only part of the story.
Note that words with low occurrence counts may be parts of larger
words that were incorrectly transcribed as isolated words.
> If the close fit vanishes if one uses the FSG
> alphabet (ignore Eva in this context), then the 'coincidence
> option' gains some ground.
I don't think so. That particular alphabet is defined by a very simple
rule: "a letter is a connected set of strokes". All other alphabets
are based on the assumption that frequent glyph combinations are
single letters. Perhaps that is simply not true...
All the best,
--stolfi