[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Help me parse arabic text!
Jason Morningstar wrote:
> I'm interested in using arabic as a language for
> comparison with the VMS. However, I find myself out of my league and hope
> someone can offer advice.
>
> I found a copy of the koran in ISO-8859-6 encoding, but tools like MONKEY
> and TACT choke on the character set.
I'd be interested to know the URL. I once downloaded a text which was
encoded in HTML, using the Bold, Italic and/or underscore attribute
to distinguish various characters. I can make the file available if
anyone is interested. I enclose a few lines below as a sample. I
removed some of the HTML trimmings to make it friendlier for E-mail.
I did write a BITRANS table to convert it to something else, which I
enclose also. This 'something else' was my first tentative way of
rendering
the arabic alphabet in ascii and I am sure it is not very good. But
it can probably serve as a template for someone who wants to do more
with this.
Cheers, Rene
----Sample-----
SOORATU ALBAQARA<b>TI</b>
Bismi All<u>a</u>hi a<b>l</b>rra<u>h</u>m<u>a</u>ni
a<b>l</b>rra<u>h</u>eem<b>i</b>
1.Alif-l<u>a</u>m-meem
2.<u>Tha</u>lika alkit<u>a</u>bu l<u>a</u>
rayba feehi hudan lilmuttaqeen<b>a</b>
----Bitrans table------
(comment) Take HTML file with transliterated Arabic and convert
(comment) First pass: clean up (+ upper to lower)
Append
0 (zero)
1 (zero)
2 (zero)
3 (zero)
4 (zero)
5 (zero)
6 (zero)
7 (zero)
8 (zero)
9 (zero)
( (zero)
) (zero)
AA c
A a
B b
D d
F f
G g
H h
I i
J j
K k
L l
M m
N n
O o
Q q
R r
S s
T t
U u
W w
Y y
Z z
- (zero)
. (zero)
(comment) remove silent characters except al-
a<b>l</b> al
<b> {
</b> }
(comment) End of this pass
NewPass:
(comment) Now follow the proper translation rules
Append.
< {
> }
a o
<u>a</u> a,
<u>at</u> a,t
<u>ata</u> a,ta,
<u>ath</u> a,d1
b b4
d d,
<u>d</u> z1
ee y
gh c1
h g
<u>h</u> h
<u>ha</u> ha,
j h4
kh h1
oo w,
o u
r r,
sh s3
<u>s</u> z
<u>sa</u> za,
t b2
<u>t</u> t
<u>ta</u> ta,
th b3
<u>th</u> d1,
<u>tha</u> d1,a,
<i><u>th</u></i> t1
w w,
z r1,
/ (zero)
NewPass:
(comment) just remove the ocurrences of word-final comma
,(space) (zero)(space)