[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
VMs: Do 2-state PFSMs distinguish vowels/consonants in VMs?
Here's what I've been working on to writing up. Sorry it's so long. Hopefully, it'll be clear to somebody.
The question I've asked myself is whether the 2-state PFSMs for various transcriptions of the VMs distinguish vowels from consontants. It works pretty well for the known languages that I've tried, but when you don't know the language, how would you know whether it's working or not?
Well, something I've noticed about known languages is that when you classify symbols according to their state transitions in PFSMs, as I've been doing, set of vowels tends to show up as a well-defined group not just in the 2-state PFSM, but also in 3 and 4-state PFSMs as well.
For example, here are the 2, 3, and 4-state PFSMs for English. On the left are the from-states; along the top are the input tokens; the table gives the mapping to the next state.
English, 2 States
3.93 bits/char
| a b c d e f g h i j k l m n o p q r s t u v w x y z
--+-----------------------------------------------------
0 | 0 1 1 1 0 1 1 1 0 1 1 1 1 0 0 1 0 1 1 1 0 1 1 0 1 1
1 | 0 1 1 1 0 1 1 1 0 1 1 1 1 1 0 1 0 1 1 1 0 1 1 - 1 1
English, 3 States
3.85 bits/char
| a b c d e f g h i j k l m n o p q r s t u v w x y z
--+-----------------------------------------------------
0 | 0 2 1 1 0 1 1 2 0 2 2 2 2 0 0 1 2 1 1 1 0 2 1 0 1 2
1 | 0 2 1 1 0 2 2 2 0 - 1 2 2 1 0 2 2 2 1 1 0 2 2 - 1 1
2 | 0 2 2 1 0 1 0 2 0 2 1 1 2 2 0 1 2 2 1 1 0 2 2 - 1 0
English, 4 States
3.57 bits/char
| a b c d e f g h i j k l m n o p q r s t u v w x y z
--+-----------------------------------------------------
0 | 0 1 3 3 0 3 3 3 0 2 3 0 3 0 0 3 2 3 3 3 0 2 3 1 3 2
1 | 0 2 2 2 0 2 2 2 0 2 1 2 2 2 0 2 2 2 1 2 0 2 2 - 2 2
2 | 0 - - - 0 2 - 2 0 - - 2 3 2 0 - - 2 3 3 0 - 2 - 3 3
3 | 0 2 2 3 0 2 2 3 0 - 3 3 2 3 0 2 2 2 3 3 0 2 2 - 3 3
I'm still using the bracket notation, so that for example in the 3-state PFSM, [a,b,c] means the set of symbols that take state 0 to state a, state 1 to state b, and state 3 to state c. The letter 'g' is in the set [1,2,0], and the set [1,2,1] is { f p }. A dash means that a symbol doesn't occur in a state.
In the 2-state PFSM, the symbols with transitions [0,0] includes { a e i o q u }; in the 3-state PFSM the group [0,0,0] is { a e i o u }; and in the 4-state PFSM the group [0,0,0,0] is again { a e i o u }. (Somehow, the letter q sneaks into the vowel group in the 2-state PFSM.)
There is no group of consonants which is so clearly defined. In the 3-state PFSM the symbols with transitions [1,1,1] is { d s t y }, but in the 4-state PFSM, d is [3,2,-,3], s is [3,1,3,3], and t and y are [3,2,3,3].
For another example, here are the 2, 3, and 4-state PFSMs for Turkish.
Turkish, 2 States
3.69 bits/char
| a b c d e f g h i j k l m n o ö p q r s t u ü v w x y z
--+---------------------------------------------------------
0 | 0 1 1 1 0 1 1 1 0 1 1 1 1 1 0 0 1 1 1 1 1 0 0 1 1 0 1 1
1 | 0 1 1 1 0 1 1 1 0 1 1 1 1 1 0 0 1 1 1 1 1 0 0 1 - 1 1 1
Turkish, 3 States
3.52 bits/char
| a b c d e f g h i j k l m n o ö p q r s t u ü v w x y z
--+---------------------------------------------------------
0 | 0 1 1 1 0 1 1 1 0 1 1 1 1 1 0 0 1 - 1 1 1 0 - 1 - 0 1 1
1 | 0 1 1 1 0 1 1 1 0 1 1 1 1 1 0 0 1 1 1 1 1 0 0 1 - 1 1 1
2 | 0 1 1 1 0 1 1 1 0 1 1 1 1 1 0 0 1 2 2 1 1 0 0 1 1 - 1 1
Turkish, 4 States
3.38 bits/char
| a b c d e f g h i j k l m n o ö p q r s t u ü v w x y z
--+---------------------------------------------------------
0 | 0 1 3 1 0 3 3 3 0 1 3 3 3 3 0 - 3 - 3 3 3 3 - 3 - 3 3 3
1 | 0 3 2 1 0 - - 1 0 - 1 1 1 1 0 0 - - 3 3 1 0 0 - - 1 - 1
2 | 0 1 1 1 0 1 1 1 0 1 1 1 1 1 0 0 1 2 1 1 1 0 0 1 1 - 1 1
3 | 0 1 1 1 0 1 1 1 0 1 1 1 1 1 0 - 3 1 1 1 1 0 0 1 - - 1 1
So here's one possibility: if there is a set of symbols, defined by their state transitions, which is consistent in the small-state PFSMs, then those symbols might be the vowels of the language.
How does this fit the VMs? Let's start with the 2, 3, and 4-state PFSM for the FSG transcription of VMs Language B:
FSG Language B, 2 States
3.29 bits/char
| A C D E F G H I K L M N O P R S T Y Z 2 4 6 7 8
--+-------------------------------------------------
0 | 1 0 0 0 0 1 0 1 1 1 1 - 1 0 0 0 0 - 0 0 0 1 - 0
1 | 1 0 0 1 0 1 0 1 1 1 1 1 1 0 1 0 0 0 - 1 0 1 1 0
FSG Language B, 3 States
3.00 bits/char
| A C D E F G H I K L M N O P R S T Y Z 2 4 6 7 8
--+-------------------------------------------------
0 | 2 0 0 0 0 1 0 2 1 2 1 - 2 0 0 0 0 - 0 0 0 1 - 0
1 | 2 0 0 1 0 1 0 2 1 2 1 1 2 0 1 0 0 - - 1 0 - 0 0
2 | 2 0 0 1 0 1 0 2 1 1 1 1 2 0 1 0 0 0 - 1 0 1 1 0
FSG Language B, 4 States
2.70 bits/char
| A C D E F G H I K L M N O P R S T Y Z 2 4 6 7 8
--+-------------------------------------------------
0 | 2 0 0 3 0 1 0 2 - 2 3 - 2 0 3 0 0 - 0 3 3 3 - 0
1 | 2 0 0 1 1 2 0 2 3 2 - - 2 1 1 0 0 - 0 1 1 3 - 0
2 | 2 0 0 3 1 2 0 2 3 3 3 3 2 1 3 0 0 1 - 3 1 3 3 0
3 | 2 0 0 3 1 2 0 2 3 2 3 3 2 1 3 0 0 - - 3 1 - - 0
I'm not sure it's possible to identify a good candidate for a VMs vowel set, based on these PFSMs. There doesn't seem to be any coherent set that persists in all PFSMs.
The group [0,0] = { C D F H P S T 4 8 } in the 2-state PFSM is identical to the set [0,0,0] in the 3-state PFSM, but in the 4-state PFSM this set is broken up into the sets [0,0,0,0] = { C D H S T 8 }, [0,1,1,1] = { F P }, and [3,1,1,1] = { 4 }. Is this significant?
The set [1,1] = { A G I K L M O 6 } in the 2-state PFSM becomes the sets [1,1,1] = { G K M }, [2,2,2] = { A I O }, [2,2,1] = { L }, and [1,-,1] = { 6 } in the 3-state PFSM. The set { A I O } persists as [2,2,2,2] in the 4-state PFSM.
The group [0,1] = { E R 2 } in the 2-state PFSM is identical to the set [0,1,1] in the 3-state PFSM and the set [3,1,3,3] in the 4-state PFSM. Could this small set represent the vowels?
Other transcriptions don't seem to fare any better. Here are the 2, 3, and 4-state PFSMs for VMs Language B in the BHP transcription that I mentioned previously:
BHP Language B, 2 States
3.41 bits/char
| a c ç d e é ê f F g i í î j k K l m n o p P q r s t T x y
--+-----------------------------------------------------------
0 | 1 0 0 0 0 0 0 0 0 1 1 1 1 - 0 0 0 1 1 1 0 0 0 0 0 0 0 - 1
1 | 1 0 0 0 1 0 0 0 0 1 1 1 1 1 0 0 1 1 1 1 0 0 0 1 1 0 0 0 1
BHP Language B, 3 States
3.09 bits/char
| a c ç d e é ê f F g i í î j k K l m n o p P q r s t T x y
--+-----------------------------------------------------------
0 | 1 0 0 0 0 0 0 0 0 2 1 1 1 - 0 0 0 2 1 1 0 0 0 0 0 0 0 - 2
1 | 1 0 0 0 1 0 0 0 0 2 1 1 1 2 0 0 2 2 2 1 0 0 0 2 2 0 0 0 2
2 | 1 0 0 0 0 0 0 0 0 - 1 1 0 0 0 0 2 2 1 1 0 - 0 2 2 0 - - 2
BHP Language B, 4 States
2.81 bits/char
| a c ç d e é ê f F g i í î j k K l m n o p P q r s t T x y
--+-----------------------------------------------------------
0 | 3 1 1 1 1 1 1 2 1 - 3 3 3 - 1 1 0 1 3 3 2 - 2 0 0 1 - - 0
1 | 3 1 1 1 1 1 1 1 1 0 3 3 3 - 1 1 0 - 3 3 1 1 0 0 0 1 1 - 0
2 | 3 1 1 1 3 1 1 2 1 0 3 3 3 - 1 1 2 0 2 3 2 1 2 1 1 1 1 - 3
3 | 3 1 1 1 3 1 1 2 1 0 3 3 3 0 1 1 0 0 0 3 2 1 2 0 0 1 1 2 0
The set [0,0] = { c ç d é ê f F k K p P q t T } in the 2-state PFSM is identical to the set [0,0,0] in the 3-state PFSM, except that now P and T never occur in state 3, and [0,0,0] is similar to the set [1,1,1,1] = { c ç d é ê F k K t } which omits { f p q }. Are the vowels in here?
The set [1,1] = { a g i í î m n o y } is reduced to [1,1,1] = { a i í o }, although this is similar to [3,3,3,3] = { a i í î o }. Maybe these are the vowels.
Anyway, I don't see it. Any or all of these sets could represent vowels, but none of them is compelling, like the vowel sets that are found in natural languages.
What would explain this? Here are some possibilities:
1) The VMs is a hoax.
2) The VMs has been encoded somehow. In this case, presumably decoding the text would make it possible to identify the vowels.
3) The VMs is written in a language which doesn't complete represent vowels, like Arabic or Sanskrit. I haven't looked at these kinds of languages, and don't know what their PFSMs would look like.
4) The text includes lots of abbreviations. Don't know how you'd analyze this using PFSMs.
5) The text represents numbers using letters, like Roman numerals, or the way classical Greek and Hebrew did. Still don't know how you'd analyze this.
6) None of the transcription alphabets yet accurately reflect the actual letters in the text (Is "daiin" supposed to be d-a-i-i-n, or d-a-ii-n, or d-aii-n, or d-a-iin, or d-ai-in, or what?) If this were the problem, then we might be able to identify vowels once the correct transcription alphabet is identified. And in fact, this might be one test for what would qualify as a good transcription alphabet.
7) There are other odd peculiarities of the VMs alphabet which confuse the method. For example, the Latin text of Mosella that I've worked with doesn't use J and V, but instead uses the letters I and U as both vowels and consonants.
So, to answer the question whether the PFSMs for the VMs identify the vowels in the language: Maybe, I dunno.
In any case, it seems to me there's probably a lot of work that could be done to try to understand the structure of words in natural languages. If we understood that, we could try applying that knowledge to the VMs, and at the least tell whether the text represents any kind of spoken language.
-Ben
______________________________________________________________________
To unsubscribe, send mail to majordomo@xxxxxxxxxxx with a body saying:
unsubscribe vms-list