[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Help me parse arabic text!
The version I'm using was found at
http://leb.net/qalam/islam/quran/
I'm very much interested in the HTML version you mentioned.
Best Regards,
Jason
----------
Jason Morningstar
School of Information and Library Science
UNC Chapel Hill
On Sun, 25 Mar 2001, Rene Zandbergen wrote:
>
>
> Jason Morningstar wrote:
>
> > I'm interested in using arabic as a language for
> > comparison with the VMS. However, I find myself out of my league and hope
> > someone can offer advice.
> >
> > I found a copy of the koran in ISO-8859-6 encoding, but tools like MONKEY
> > and TACT choke on the character set.
>
> I'd be interested to know the URL. I once downloaded a text which was
> encoded in HTML, using the Bold, Italic and/or underscore attribute
> to distinguish various characters. I can make the file available if
> anyone is interested. I enclose a few lines below as a sample. I
> removed some of the HTML trimmings to make it friendlier for E-mail.
>
> I did write a BITRANS table to convert it to something else, which I
> enclose also. This 'something else' was my first tentative way of
> rendering
> the arabic alphabet in ascii and I am sure it is not very good. But
> it can probably serve as a template for someone who wants to do more
> with this.
>
> Cheers, Rene
>
>
> ----Sample-----
>
>
>
> SOORATU ALBAQARA<b>TI</b>
>
>
>
> Bismi All<u>a</u>hi a<b>l</b>rra<u>h</u>m<u>a</u>ni
>
> a<b>l</b>rra<u>h</u>eem<b>i</b>
>
>
> 1.Alif-l<u>a</u>m-meem
>
>
>
> 2.<u>Tha</u>lika alkit<u>a</u>bu l<u>a</u>
>
> rayba feehi hudan lilmuttaqeen<b>a</b>
>
>
> ----Bitrans table------
>
> (comment) Take HTML file with transliterated Arabic and convert
> (comment) First pass: clean up (+ upper to lower)
> Append
> 0 (zero)
> 1 (zero)
> 2 (zero)
> 3 (zero)
> 4 (zero)
> 5 (zero)
> 6 (zero)
> 7 (zero)
> 8 (zero)
> 9 (zero)
> ( (zero)
> ) (zero)
> AA c
> A a
> B b
> D d
> F f
> G g
> H h
> I i
> J j
> K k
> L l
> M m
> N n
> O o
> Q q
> R r
> S s
> T t
> U u
> W w
> Y y
> Z z
> - (zero)
> . (zero)
> (comment) remove silent characters except al-
> a<b>l</b> al
> <b> {
> </b> }
> (comment) End of this pass
> NewPass:
> (comment) Now follow the proper translation rules
> Append.
> < {
> > }
> a o
> <u>a</u> a,
> <u>at</u> a,t
> <u>ata</u> a,ta,
> <u>ath</u> a,d1
> b b4
> d d,
> <u>d</u> z1
> ee y
> gh c1
> h g
> <u>h</u> h
> <u>ha</u> ha,
> j h4
> kh h1
> oo w,
> o u
> r r,
> sh s3
> <u>s</u> z
> <u>sa</u> za,
> t b2
> <u>t</u> t
> <u>ta</u> ta,
> th b3
> <u>th</u> d1,
> <u>tha</u> d1,a,
> <i><u>th</u></i> t1
> w w,
> z r1,
> / (zero)
> NewPass:
> (comment) just remove the ocurrences of word-final comma
> ,(space) (zero)(space)
>