Back Home Up Next

The Nijmegen Arabic/Dutch Dictionary Project

1.4.8 Stripped Corpus

 

As discussed elsewhere, available frequency lists of Arabic are limited in size.

However, we needed such a list in order to expand the Arabic macro list after completion of stage 1, the addition of Arabic translations to the Dutch reference file.

 

So we had to create our own frequency list from the Arabic corpus. Since the corpus is not tagged or lemmatized, we could not confine ourselves to extracting a frequency list from the corpus with the concordancy program.

 

First of all, the corpus had to be lemmatized in some way. Since it would be extremely time-consuming to do this manually, it was decided to perform a rough lemmatizing process by running a number of search-and-replace operations to separate frequent prefixes and suffixes from words in the corpus. This has resulted in a list of many thousands of words, which were used to check against the entries of the ALC.

 

It goes without saying such a list could only be a means to add words that were not present in the database by that time, without any guarantee of completeness.

 

So in order to check the completeness, other steps had to be taken as well: comparing with other dictionaries, checking by reading texts.

 

However, the use of the 'stripped corpus' did to a certain extent contribute to the completion of the Arabic Lexical Corpus (ALC). The number of words added as a result of this was rather limited though. The table which is shown in the third link below contains 286 words only, and not all of these have been added. Since the process of creating this stripped corpus took a considerable amount of time, one might be tempted to consider this whole procedure rather ineffective.

 

Below are some examples of the stripped corpus, as well as a list of words that could be added to the database as a result of this procedure.

bullet

example pages from the stripped text corpus (pdf)

bullet

examples from frequency list from stripped corpus (pdf)

bullet

list of words added as a result of this procedure (pdf)

reactions to: j.hoogland@let.kun.nl
last updated 26/10/2003 15:16 +0100
Back Home Up Next