1.13.1 The coverage of the Arabic vocabulary.
As mentioned elsewhere, the macro structure of our Arabic-Dutch dictionary contains 24.000 words. Since the prevailing view is that the Arabic vocabulary is very extensive, we might ask ourselves if a dictionary containing 24.000 words will serve the user sufficiently when reading or listening to Arabic.
It is my impression that 24.000 words do guarantee a good coverage of the vocabulary of texts written in Modern Standard Arabic. This impression is based on the fact that while I was reading many hundreds of pages of Arabic texts during the course of the project I did not encounter many words that were not incorporated in the macro structure already.
The main reason for reading those texts was to extract collocations and idiomatic expressions, but while doing so, I also compiled a list of less frequent words (in some cases words I was unfamiliar of their meaning) in order to check if these words were in our macro structure or not. But this was just an impression during the stage of compilation, although it was a reassuring impression. I felt there was no risk of omitting great numbers of frequent words.
another Arabic-Dutch dictionary
Another reassurance was the observation of my colleague Mark Van Mol, who published his learner's dictionaries of Arabic in 2001. In an article about his project, which was also published on
his website about his dictionaries
he states as follows:
This resulted into two learner's dictionaries. One Arabic-Dutch of 17,000 Arabic entries, and one Dutch-Arabic of ca. 20,000 entries. Samples of different texts point out that this learner's dictionary covers 99% of the vocabulary of any average text. This means that in spite of the limited macro, (the large dictionary of Hans Wehr, contains approximately 45,000 words), we cover almostthe whole range of the actual vocabulary. It also means that a learner ought to be able to understand every modern Arabic text in using this dictionary.
Testing the comprehensiveness of our Arabic macro list
Since this observation was
to serve as a reassurance for myself I asked Mark Van Mol how he had tested this. In private communication Van Mol informed me that while reading texts, and having the macro structure of his Arabic-Dutch dictionary in mind, he observed that in every 100 words he would encounter a word that was not incorporated in his dictionary.
However, for the sake of absolute certainty about the comprehensiveness of our macro, as well as for the sake of overthrowing the prevailing view of the extensive Arabic vocabulary, I decided, after finishing the compilation stage, to perform some tests.
First of all, I saw an opportunity to make use of the 'stripped corpus' which has been
described elsewhere. In this stripped corpus words have been lemmatized very roughly, and I decided to process a part of this corpus in order to extract real lemmatized data, and to compare these data with the macro list of our Arabic-Dutch dictionary.
This processing consisted of separating prefixes and suffixes, converting verbs into perfect stems etc. I have done this for a number of text files containing 167 KB of texts.
Through the following link an
example of this lemmatized text
can be viewed.
As for the number of words, it is not usefull to count the number of words in these texts with a word processor since many spaces have been added in order to separate the affixes. To obtain an impression about the number of words within these files I have counted the words in a normal tex tfile of 141 KB, and the number of words was 24.764 (word count with MS Word). Since the lemmatized corpus contains extra spaces which do contribute to the file size, I assume comparing a non-lemmatized text of 141 KB with a lemmatized corpus of 167 KB is justified, and leads to the conclusion that the lemmatized corpus contains about 25.000 words.
From this lemmatized corpus I extracted a frequency list of all the words occurring in the corpus. This frequency list consisted of 3367 units, but a part of these units were affixes and other characters or combinations of characters that cannot be considered words. The list of these units was compared (in an Access database) with a table containing the macro structure of our Arabic Dutch dictionary. This comparison resulted in a list of 38 words that appeared absent in our database. This list is showed below.
In this list of 38 words there are:
- 13 participles of which the verbs were in the database
- 4 comparatives of which the adjectives were in the database
- 3 masdars of which the verbs were in the database
- 3 nisbah adjectives of which the base words were in the database
- 1 plural of which the singular was in the database.
So for 24 of these 38 words the
(passive) dictionary user would already have found an indication of the meaning
of those 24 words, and only in 14 cases would the user be left without an
answer. I consider a number of 14 words on the basis of a text of 25.000 words
(the estimated size of the lemmatized corpus) to be a very low rate of missing
words. Where Mark Van Mol, with a macro structure of 17.000 words observed one
missing word in every 100 words (1%), our dictionary with its macro structure of
24.000 words results in one missing word in 1785 words (25.000/14), i.e. 0.056%.
I have performed another test in order to verity the comprehensiveness of our macro structure. I have started reading a number of pages from two different literary texts.
Reading a fragment of 3800 words from a text found at the
website of the Arab Writers Union resulted in the addition of 5 words to the database, as shown in the table below:
In this list of 5 words there are:
- 2 participles of which the verbs were in the database
- 2 masdars of which the verbs were in the database
So one really missing word out of 3800 was added, which results in a percentage of 0.026%.
The same procedure was followed with a number of pages from Nagib Mahfouz's novel Tharthara fawq an-nil
(ثرثرة فوق النيل). A fragment of 8900 words was read in order to extract missing words. This has lead to the addition of 10 new words as can be seen in the table below.
In this list of 10 words there are:
- 4 participles of which the verbs were in the database
- 1 masdar of which the verb was in the database
Reading texts containing 8900
words resulted in adding 5 words to the database, which is a percentage of
0.056%, which equals the percentage mentioned above after comparisons with the
25.000 lemmatized corpus.
With a macro structure containing 24.000 entries a coverage of 99.5% of texts in Modern Standard Arabic is realised. And in order to expand this macro structure text reading would demand a considerable effort since thousands of words have to be read in order to find new words. It is obvious this process can only be done in an automated way. This however demands the availability of lemmatized and/or tagged text corpora. More about this can be read in section 1.6.6.
When I was reading the literary text, for example the Mahfouz novel, I encountered a considerable number of words that were not found in the existing dictionaries. These words are probably colloquial words which can probably be found in dialect dictionaries like Badawi, or the specialized dictionary of Vial (l'égyptien tel qu'on l'écrit).
A few examples of these words are:
شخذ، مستلق، تنابلة، أبراص، هاموش، هاكم
When I checked Vial I only found the word فنابلة in the expression تنالبة السلطان which in Vial is translated as "paresseux du sultan" (lazybones of the sultan). This does not combine with the context in the book (الهموم والتنابلة والإكلشيهات ).
The other words could not be found in the list of Vial.
The text of this novel was scanned and processed with OCR software, so there is a chance that one or more of these words result from OCR-errors. Since the original copy of the book was not available at the time of writing, I have not yet been able to verify this.