[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
VMs: Best-fit 2-state PFSMs
While playing with FSMs, I noticed that a best-fit (minimal information) 2-state PFSM does a decent job of distinguishing vowels from consonants.
A 2-state FSM categorizes symbols by the transitions that happen on their input. If the states are labeled 0 and 1, then there are four cases: {0->0, 1->0}, {0->0, 1->1}, {0->1, 1->0}, and {0->1, 1->1}. We can abbreviate these {0,0}, {0,1}, {1,0}, and {1,1}. If it's also possible that a symbol may never occur in some states, then there are four other cases: {-,0}, {-,1}, {0,-}, and {1,-}.
Here's some examples of best-fit 2-state PFSMs.
English (Anne of Green Gables)
3.933 bits/character
{0,0} : B C D F G H J K L M P R S T V W Y
{0,1} : N
{1,0} :
{1,1} : A E I O U Q
{0,-} : X
Latin (Mosella)
3.870 bits/character
{0,0} : A E I O U Y
{0,1} : N
{1,0} :
{1,1} : B C D F G H L M P Q R S T
{1,-} : X
NOTE: In this text, 'i' and 'u' were used as both consonants and a vowels. The letters 'j' and 'v' did not appear.
Turkish (I think - http://www.ecst.csuchico.edu/~nazan/opamuk/article/oktay.html)
3.686 bits/character
{0,0} : A E I O Ö U Ü
{0,1} : X
{1,0} :
{1,1} : B C D F G H J K L M N P Q R S T V Y Z
{1,-} : W
Russian (Bednaya Liza)
4.133 bits/character
{0,0} : b v g d zh z k l m n p r s t kh ts ch sh
{0,1} :
{1,0} :
{1,1} : a ye yo i o e' yu ya
{-,0} : shch
{0,-} : '
{-,1} : i'
{1,-} : u y j
NOTE: The PFSM used Cyrillic characters, not Roman; I transcribed this by hand according to http://erc.lib.umn.edu/cyr-trans.html. The text was a bit short, only 13,497 characters. A larger text might show fewer characters that only occur in a single state. Presumably, 'u' and 'y' might end up with the other vowels, and 'shch' might end up with the other consonants.
I've gotten similar results with Spanish, Italian, French, Vietnamese, Romanized Chinese, and Indonesian. I haven't yet looked at consonant-only languages like Arabic and Hebrew, or syllabic languages like Japanese Kana.
I'll follow up with something about the VMs.
-Ben
______________________________________________________________________
To unsubscribe, send mail to majordomo@xxxxxxxxxxx with a body saying:
unsubscribe vms-list