[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: VMs: f85r2 "four ages" diagram

--- Bruce Grant <bgrant@xxxxxxxxxxxxx> wrote:
> This sounds interesting ... could you expand on it a
> little? What sort 
> of algorithms have you tried?

Basically, I've tried two main objective functions and
two main algorithms, with many variations of each.

Objective Functions

** Collocation Using Likelihood Ratios
I took this idea originally from an application to
word collocations and applied it to letter
collocation. Collocation simply meaning that when two
items appear together, they do so on purpose - they
appear more often together than if they were
distributed randomly to a statistically significant
degree. A likelihood ratio is a bit more statistically
robust than a t test or chi-square, but the idea is
essentially the same - calculate the frequency two
items appear together and ratio with the null
hypothesis (how often they should appear together if
their distributions were random). The exact formula
for the log of the likelihood ratio is:

log L(c12, c1, p) + log L(c2 - c12, N - c1, p) - log
L(c12, c1, p1) - log L(c2 - c12, N - c1, p2)

where L(k,n,x) = (x^k)(1-x)^(n-k)
with p = c2/N, p1 = c12/c1 and p2 = (c2 - c12)/(N -
c1) and c1, c2, and c12 are the counts of the first
letter (or word), the second and the two appearing
together. N is the number of all combinations (bigrams
essentially). For reference, I use "Foundations of
Statistical Natural Language Processing" by Manning &
Schutze - that has the whole derivation.

** Minimum Description Length
This simply states that the length of the lexicon plus
the length of the message using it will be a minimum.
For example, if I have the string "andandand", the MDL

Lexicon - 1 = "and" (length = 3)
Message - 1, 1, 1 (length = 3)
MDL = 3+3 = 6

Consider that versus:
Lexicon - 1 = "andandand" (length = 9)
Message - 1 (length = 1)
MDL = 9+1 = 10

Lexicon - 1 = "a" (length = 1)
2 = "n" (length = 1)
3 = "d" (length = 1)
Message - 1, 2, 3, 1, 2, 3, 1, 2, 3 (length = 9)
MDL = (1+1+1)+9 = 12

That's a pretty contrived example, but you get the

These had the most variance, but basically fell into
two groups.

** Aggregation
Mostly used with the Likelihood ratios.

Simplest - find the highest likelihood ratio bigram
(that is still significant) and combine into one new
symbol. Re-evaluate to find the new highest likelihood
bigram (which includes the symbol from the previous
step as a single gram). Keep going until nothing
statistically significant is left. Problem... just
about everything in a language is statistically
significant! It's hard for the algorithm to know when
to stop. Also, some word collocations are more
statistically signficant than letter collocations, so
words would bind together into phrases before other
words were finished. I couldn't find any telling
factor to let the algoritm know when this was
happening. Finally, the collocation statistics are
global while the individual letter relationships are
local. For instance, in English, "th" stands out far
and away as a collocation of letters (really should be
its own letter since it has its own phoneme) but...
not all instances where t is followed by h should be
joined - "THe boaT House".

Which brought a variation to this algorithm trying to
consider the effects of collocations relative to one
another. So, in "the boat house", I looked at not only
how strong the "t h" collocation was, but how strong
the "at" and "ho" were in relation. This helped, so I
expanded it to include "o#t" and "a#h" and "t#o" and
"h#u" relations (where # means any character). I took
this out to the longest significant relation. This
helped some more but... it never differntiated enough
to give a better than 1:1 signal to noise ratio (for
every real word it came back with, it came back with a
word or phrase fragment).

I tried some MDL based aggregation methods, but they
never got a good hold on what to start forming into

** Monte Carlo/Genetic
Basically, for every position in the input text, find
every recurring string of each length (so for
"abcabc", i would find "a", "ab" and "abc" at the
first position). These then formed the possible
choices at each location. Random parses ('genes') were
created using this and the objective function (usually
MDL in this case, but I did do some likelihood ratios)
evaluated. The pool of the more successful solutions
formed the basis of parses in the next set of
attempts. Exact implementations varied a lot in this
respect trying to tune the algorithm. Also, I tried
different functions weighting the statistical
probability of a choice being made at each stage in
generating a parse (generally, the more frequently a
parse appeared overall, the more I favored it).

Anyways, these attempts also suffered at not knowing
when to stop - in this case far too soon. The
algorithm would usually find a local minimum too
quickly and quit or just wander and wander randomly
and not find any good path.

I never tried any of this on Voynichese - I spent all
my time parsing in English and Latin. The idea was to
find a good language independent word segmentation
algorithm. I never got it to work well enough on known
languages to run it against the VMS and provide any
significant result.

Do you Yahoo!?
Yahoo! Mail - 50x more storage than other providers!
To unsubscribe, send mail to majordomo@xxxxxxxxxxx with a body saying:
unsubscribe vms-list