Home Up Next

The Nijmegen Arabic/Dutch Dictionary Project

1.13.1 The coverage of the Arabic vocabulary.

bullet

introduction

bullet

Coverage by another Arabic-Dutch dictionary

bullet

Testing the comprehensiveness of our Arabic macro list

     comprehensive test

     reading test

bullet

conclusion

bullet

final remark

 


introduction

As mentioned elsewhere, the macro structure of our Arabic-Dutch dictionary contains 24.000 words. Since the prevailing view is that the Arabic vocabulary is very extensive, we might ask ourselves if a dictionary containing 24.000 words will serve the user sufficiently when reading or listening to Arabic.

It is my impression that 24.000 words do guarantee a good coverage of the vocabulary of texts written in Modern Standard Arabic. This impression is based on the fact that while I was reading many hundreds of pages of Arabic texts during the course of the project I did not encounter many words that were not incorporated in the macro structure already.

The main reason for reading those texts was to extract collocations and idiomatic expressions, but while doing so, I also compiled a list of less frequent words (in some cases words I was unfamiliar of their meaning) in order to check if these words were in our macro structure or not. But this was just an impression during the stage of compilation, although it was a reassuring impression. I felt there was no risk of omitting great numbers of frequent words.


Coverage by another Arabic-Dutch dictionary

Another reassurance was the observation of my colleague Mark Van Mol, who published his learner's dictionaries of Arabic in 2001. In an article about his project, which was also published on his website about his dictionaries  he states as follows:

This resulted into two learner's dictionaries. One Arabic-Dutch of 17,000 Arabic entries, and one Dutch-Arabic of ca. 20,000 entries. Samples of different texts point out that this learner's dictionary covers 99% of the vocabulary of any average text. This means that in spite of the limited macro, (the large dictionary of Hans Wehr, contains approximately 45,000 words), we cover almost the whole range of the actual vocabulary. It also means that a learner ought to be able to understand every modern Arabic text in using this dictionary.


Testing the comprehensiveness of our Arabic macro list

Since this observation was to serve as a reassurance for myself I asked Mark Van Mol how he had tested this. In private communication Van Mol informed me that while reading texts, and having the macro structure of his Arabic-Dutch dictionary in mind, he observed that in every 100 words he would encounter a word that was not incorporated in his dictionary.


However, for the sake of absolute certainty about the comprehensiveness of our macro, as well as for the sake of overthrowing the prevailing view of the extensive Arabic vocabulary, I decided, after finishing the compilation stage, to perform some tests.

bullet

The first test was a comprehensive test since I have lemmatized a number of texts and produced a frequency list which could be compared with the macro structure of our database

bullet

The second test was similar to the test reported on by Mark Van Mol, since I have read a number of text fragments, and during the reading process I registered all words of which I was not certain they were included in our database.


comprehensive test

First of all, I saw an opportunity to make use of the 'stripped corpus' which has been described elsewhere. In this stripped corpus words have been lemmatized very roughly, and I decided to process a part of this corpus in order to extract real lemmatized data, and to compare these data with the macro list of our Arabic-Dutch dictionary.

This processing consisted of separating prefixes and suffixes, converting verbs into perfect stems etc. I have done this for a number of text files containing 167 KB of texts.

Through the following link an example of this lemmatized text can be viewed.

As for the number of words, it is not usefull to count the number of words in these texts with a word processor since many spaces have been added in order to separate the affixes. To obtain an impression about the number of words within these files I have counted the words in a normal tex tfile of 141 KB, and the number of words was 24.764  (word count with MS Word). Since the lemmatized corpus contains extra spaces which do contribute to the file size, I assume comparing a non-lemmatized text of 141 KB with a lemmatized corpus of 167 KB is justified, and leads to the conclusion that the lemmatized corpus contains about 25.000 words.


From this lemmatized corpus I extracted a frequency list of all the words occurring in the corpus. This frequency list consisted of 3367 units, but a part of these units were affixes and other characters or combinations of characters that cannot be considered words. The list of these units was compared (in an Access database) with a table containing the macro structure of our Arabic Dutch dictionary. This comparison resulted in a list of 38 words that appeared absent in our database. This list is showed below.

 

word

Action

comment

تصنيعي

added

nisbah adjective of تصنيع

أفظع

added IV+elatief

comparative + verb

مرذول

added

participle, verb present in database

خليق

added

 

مهادن

added

participle, verb present in database

وغل

added

 

مترهل

added

participle, verb present in database

تشوف

added

 

متاعب

added

Plural, singular present in database

معصوب

added

participle, verb present in database

مقلص

added

participle, verb present in database

موضح

added

participle, verb present in database

أروع

added

Comparative

استقلالية

added

 

أصعب

added

Comparative

أفخر

added

comparative

أنصح

added

 

انصياع

added

Masdar, verb present in database

بداوة

added

 

ترؤس

added

Masdar, verb present in database

تطبيب

added

Masdar, verb present in database

ثانوية

added

 

جرسون

Added

 

جنيفر

Added

 

سحائي

Added

nisbah adjective of سحاءة

ضارة

Added

 

عضلي

Added

nisbah adjective of عضل

فرادة

Added

 

متبع

Added

participle, verb present in database

متمثل

Added

participle, verb present in database

محجة

Added

 

مستمتع

Added

participle, verb present in database

مستميت

Added

participle, verb present in database

مضبب

Added

participle, verb present in database

معمور

Added

participle, verb present in database

منازلة

added

 

منغم

added

participle, verb present in database

وشيش

added

 

 


In this list of 38 words there are:

- 13 participles of which the verbs were in the database

- 4 comparatives of which the adjectives were in the database

- 3 masdars of which the verbs were in the database

- 3 nisbah adjectives of which the base words were in the database

- 1 plural of which the singular was in the database.

So for 24 of these 38 words the (passive) dictionary user would already have found an indication of the meaning of those 24 words, and only in 14 cases would the user be left without an answer. I consider a number of 14 words on the basis of a text of 25.000 words (the estimated size of the lemmatized corpus) to be a very low rate of missing words. Where Mark Van Mol, with a macro structure of 17.000 words observed one missing word in every 100 words (1%), our dictionary with its macro structure of 24.000 words results in one missing word in 1785 words (25.000/14), i.e. 0.056%.


reading test

I have performed another test in order to verity the comprehensiveness of our macro structure. I have started reading a number of pages from two different literary texts.

Reading a fragment of 3800 words from a text found at the website of the Arab Writers Union resulted in the addition of 5 words to the database, as shown in the table below:

Word

Comment

منقوع

Participle, verb already in the database

مضفور

Participle, verb already in the database

هسهسة

Masdar, already mentioned with the verb

جلافة

 

فطام

Masdar, verb already in the database

 

 


In this list of 5 words there are:

- 2 participles of which the verbs were in the database

- 2 masdars of which the verbs were in the database

So one really missing word out of 3800 was added, which results in a percentage of 0.026%.


The same procedure was followed with a number of pages from Nagib Mahfouz's novel Tharthara fawq an-nil (ثرثرة فوق النيل). A fragment of 8900 words was read in order to extract missing words. This has lead to the addition of 10 new words as can be seen in the table below.

Word

Comment

شامت

Participle, verb already in the database

نخامي

غدة نخامية as collocation

ندالة

 

متهاتف

Participle, verb already in the database

افتتان

masdar, verb already in the database

رؤوم

 

معبّق

Participle, verb already in the database

مضمّخ

Participle, verb already in the database

جهش

جهش بالبكاء as expression

هزيع

 

 


In this list of 10 words there are:

- 4 participles of which the verbs were in the database

- 1 masdar of which the verb was in the database

Reading texts containing 8900 words resulted in adding 5 words to the database, which is a percentage of 0.056%, which equals the percentage  mentioned above after comparisons with the 25.000 lemmatized corpus.


conclusion

With a macro structure containing 24.000 entries a coverage of 99.5% of texts in Modern Standard Arabic is realised. And in order to expand this macro structure text reading would demand a considerable effort since thousands of words have to be read in order to find new words. It is obvious this process can only be done in an automated way. This however demands the availability of lemmatized and/or tagged text corpora. More about this can be read in section 1.6.6.


final remark

When I was reading the literary text, for example the Mahfouz novel, I encountered a considerable number of words that were not found in the existing dictionaries. These words are probably colloquial words which can probably be found in dialect dictionaries like Badawi, or the specialized dictionary of Vial (l'égyptien tel qu'on l'écrit).

A few examples of these words are:

شخذ، مستلق، تنابلة، أبراص، هاموش، هاكم

When I checked Vial I only found the word فنابلة in the expression تنالبة السلطان which in Vial is translated as "paresseux du sultan" (lazybones of the sultan). This does not combine with the context in the book (الهموم والتنابلة والإكلشيهات ).

The other words could not be found in the list of Vial.

The text of this novel was scanned and processed with OCR software, so there is a chance that one or more of these words result from OCR-errors. Since the original copy of the book was not available at the time of writing, I have not yet been able to verify this.

reactions to: j.hoogland@let.kun.nl
last updated 26/10/2003 15:16 +0100
Home Up Next