[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: VMs: RE: RE: Best-fit 2-state PFSMs



Nick:
Quick implementation hack for the interested: for a given corpus, build
a square matrix containing the letter transition counts (ie a
matrix[x,y] recording how many times letter x is followed by letter y
within that corpus), and work from that instead. Much easier! :-)

Me:
Just pointing out that for a text with N characters, a digraph adjacency
model is really an N-state PFSM, a trigraph adjacency model is really an
N^2-state PFSM, etc.

Nick:
This is presumably the "evolution" stage you mentioned before, right?
How do you seed it?

Me:
I've tried it two ways - setting everything to zero, and setting
everything to random values.  The former subjectively seems to have less
range than the latter - there seems to be some kind of limit to how far
the evolution will wander before settling into a local minumum.  For
larger PFSMs, I would do several runs with different random initial
values.

Nick:
Perhaps the best approach might simply be to compare the curves of (best
fit PFSM's information content) vs (number of states in the PFSM) for
different transcriptions? I'd predict that the best transcription should
show a sharp drop in information content once a critical number of
states is included... just a thought! :-o

Me:
I'd hoped that this would be true, but I never seemed to see it in real
languages (I mostly tried Latin), only in small toy languages that I
invented just for testing.  Real languages apparently can't be
realistically modeled without very, very large FSMs, so if a drop like
this occurred, it may not happen until the FSM has hundreds of states.
The largest I've ever tried to generate was 120 states.

-Ben
______________________________________________________________________
To unsubscribe, send mail to majordomo@xxxxxxxxxxx with a body saying:
unsubscribe vms-list