[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

*To*: vms-list@xxxxxxxxxxx*Subject*: Re: VMs: f85r2 "four ages" diagram*From*: Eric <mynumberis2000@xxxxxxxxx>*Date*: Mon, 12 Jul 2004 20:21:02 -0700 (PDT)*In-reply-to*: <40F04D6D.4020906@mail.msen.com>*Reply-to*: vms-list@xxxxxxxxxxx*Sender*: owner-vms-list@xxxxxxxxxxx

--- Bruce Grant <bgrant@xxxxxxxxxxxxx> wrote: > This sounds interesting ... could you expand on it a > little? What sort > of algorithms have you tried? Basically, I've tried two main objective functions and two main algorithms, with many variations of each. Objective Functions ------------------- ** Collocation Using Likelihood Ratios I took this idea originally from an application to word collocations and applied it to letter collocation. Collocation simply meaning that when two items appear together, they do so on purpose - they appear more often together than if they were distributed randomly to a statistically significant degree. A likelihood ratio is a bit more statistically robust than a t test or chi-square, but the idea is essentially the same - calculate the frequency two items appear together and ratio with the null hypothesis (how often they should appear together if their distributions were random). The exact formula for the log of the likelihood ratio is: log L(c12, c1, p) + log L(c2 - c12, N - c1, p) - log L(c12, c1, p1) - log L(c2 - c12, N - c1, p2) where L(k,n,x) = (x^k)(1-x)^(n-k) with p = c2/N, p1 = c12/c1 and p2 = (c2 - c12)/(N - c1) and c1, c2, and c12 are the counts of the first letter (or word), the second and the two appearing together. N is the number of all combinations (bigrams essentially). For reference, I use "Foundations of Statistical Natural Language Processing" by Manning & Schutze - that has the whole derivation. ** Minimum Description Length This simply states that the length of the lexicon plus the length of the message using it will be a minimum. For example, if I have the string "andandand", the MDL is: Lexicon - 1 = "and" (length = 3) Message - 1, 1, 1 (length = 3) MDL = 3+3 = 6 Consider that versus: Lexicon - 1 = "andandand" (length = 9) Message - 1 (length = 1) MDL = 9+1 = 10 or: Lexicon - 1 = "a" (length = 1) 2 = "n" (length = 1) 3 = "d" (length = 1) Message - 1, 2, 3, 1, 2, 3, 1, 2, 3 (length = 9) MDL = (1+1+1)+9 = 12 That's a pretty contrived example, but you get the idea. Algorithms ---------- These had the most variance, but basically fell into two groups. ** Aggregation Mostly used with the Likelihood ratios. Simplest - find the highest likelihood ratio bigram (that is still significant) and combine into one new symbol. Re-evaluate to find the new highest likelihood bigram (which includes the symbol from the previous step as a single gram). Keep going until nothing statistically significant is left. Problem... just about everything in a language is statistically significant! It's hard for the algorithm to know when to stop. Also, some word collocations are more statistically signficant than letter collocations, so words would bind together into phrases before other words were finished. I couldn't find any telling factor to let the algoritm know when this was happening. Finally, the collocation statistics are global while the individual letter relationships are local. For instance, in English, "th" stands out far and away as a collocation of letters (really should be its own letter since it has its own phoneme) but... not all instances where t is followed by h should be joined - "THe boaT House". Which brought a variation to this algorithm trying to consider the effects of collocations relative to one another. So, in "the boat house", I looked at not only how strong the "t h" collocation was, but how strong the "at" and "ho" were in relation. This helped, so I expanded it to include "o#t" and "a#h" and "t#o" and "h#u" relations (where # means any character). I took this out to the longest significant relation. This helped some more but... it never differntiated enough to give a better than 1:1 signal to noise ratio (for every real word it came back with, it came back with a word or phrase fragment). I tried some MDL based aggregation methods, but they never got a good hold on what to start forming into words. ** Monte Carlo/Genetic Basically, for every position in the input text, find every recurring string of each length (so for "abcabc", i would find "a", "ab" and "abc" at the first position). These then formed the possible choices at each location. Random parses ('genes') were created using this and the objective function (usually MDL in this case, but I did do some likelihood ratios) evaluated. The pool of the more successful solutions formed the basis of parses in the next set of attempts. Exact implementations varied a lot in this respect trying to tune the algorithm. Also, I tried different functions weighting the statistical probability of a choice being made at each stage in generating a parse (generally, the more frequently a parse appeared overall, the more I favored it). Anyways, these attempts also suffered at not knowing when to stop - in this case far too soon. The algorithm would usually find a local minimum too quickly and quit or just wander and wander randomly and not find any good path. I never tried any of this on Voynichese - I spent all my time parsing in English and Latin. The idea was to find a good language independent word segmentation algorithm. I never got it to work well enough on known languages to run it against the VMS and provide any significant result. __________________________________ Do you Yahoo!? Yahoo! Mail - 50x more storage than other providers! http://promotions.yahoo.com/new_mail ______________________________________________________________________ To unsubscribe, send mail to majordomo@xxxxxxxxxxx with a body saying: unsubscribe vms-list

**References**:**Re: VMs: f85r2 "four ages" diagram***From:*Bruce Grant

- Prev by Date:
**Re: VMs: Arguments against a code book?** - Next by Date:
**Re: VMs: Hmmmm....** - Previous by thread:
**Re: VMs: f85r2 "four ages" diagram** - Next by thread:
**Re: VMs: f85r2 "four ages" diagram** - Index(es):