The Nijmegen Arabic/Dutch Dictionary Project

1.6.6 the absence of a voweled text corpus

As described in section 1.4.6, the tool of the concordance program in combination with the text corpus was an essential aid in the execution of the project.

However, the fact that Arabic is normally written without (short) vowels is a major impediment. It makes searching in the text corpus less accurate since the absence of the vowels in both the searched text and the search string implies the search results will be 'polluted' with unwanted hits.

So a voweled corpus would be very useful, not only for lexicographers. One step further would be a tagged corpus, i.e. a corpus that is annotated with all kinds of coding to indicate Part of Speech, time and/or aspect of verbs etc.
However, such a tagged corpus is in 2003 still a desideratum. It is known the Sakhr company owns a text corpus that has been tagged, but the company seems reluctant to share this with the academic community.

It is my proposition that the fact that unvoweled written Arabic lacks so much essential information that disambiguation will remain very difficult.

I was demonstrated by one of the project members of the DINAR-project that the combination of mim and nun in Arabic (كل) can be analyzed in 16 different readings (among them different forms of the verb manna).

Several initiatives are taking place in this field, like the Morphologic Analyzer by Tim Buckwalter at the Linguistic Data Consortium (LDC) or the Arabic Tree Bank project in Prague.

last updated 26/10/2003 13:29 +0100
