[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: VMs: RE: VMs word models --> state machines...?



Hi Ben,

At 15:58 02/03/2005 -0600, Ben Preece wrote:
Hello.  I've only been a lurker here so far, but may(?) now have
something relevant to say ... I've kludged up a Java app that uses
evolution and measures information to find a best-fit FSM for input
text, and run it on the VMs.  I'll share the app and my "results" with
anyone who's interested.

Naturally, I'm very interested - but given that the question of how best to go about this is a tricky one, I would be fairly surprised if your algorithms for both evolution and fitting have managed to nail it in one go (I'm sure you feel much the same, given your scare quotes). :-0


Expanding a little: it may seem like you have a decent-sized sample to work with, but you have many different implicit problems to contend with:-
* The whole Currier A/B (and all shades in between) thing
* Neal keys (long gallows or pxxxxxxp sequences, usually on line 1 of paragraphs)
* The first character on a line often seems non-systematic
* The last character on a line has a different distribution (lots of -am words)
* Outer pages of quires seem a bit more arbitrary (perhaps because of corrections)
* Labels and other special text forms seem to function slightly differently


So, if you want to get a reasonably consistent input text corpus to operate on, you need to do a *lot* of non-obvious filtering. Unfortunately, there's currently no special interlinear flag to indicate whether a page is on the outside of a quire (I've been meaning to suggest this to Jorge Stolfi for a while), but the list of page names to filter out there should be fairly obvious.

All in all, I'd suggest filtering out long gallows, Neal keys, line-initial characters, line-terminal -am characters, as well as the outside pages of quires: and sticking to a filtered version of either Herbal A, Herbal B, or the balneological section.

But even with the input text normalized in this kind of way, you still have a bit of work to do to get properly started. As Rene pointed out, choosing the transcription to start from remains problematical: I'd suggest a reasonable starting point would be to convert all cXh / ch / sh / iiii / iii / ii / eee / ee / qo sequences into new tokens... but that's only a start. As you probably know, I'm a proponent of verbose ciphers (dy / ol / or / al / ar / am etc as single tokens) as being a likely part of the VMs' cipher system, but it's not clear to me how a best-fit FSM search sequence might best draw these out - perhaps you've already thought about this issue.

Even given this plethora of caveats, it would be interesting to see what differences in FSM your code produces between (say) normalized Herbal A and Herbal B corpora.

BTW, are your FSMs memoryless? Does your code look for hidden states? etc etc (cont. p.94 in CS 1.0.1 textbook)

Cheers, .....Nick Pelling.....


______________________________________________________________________ To unsubscribe, send mail to majordomo@xxxxxxxxxxx with a body saying: unsubscribe vms-list