Back Home Next

The Nijmegen Arabic/Dutch Dictionary Project

 

1.4 Working Methods

the first stage: translating the Dutch lexical corpus into Arabic

the second stage: completing the Arabic lexical corpus

conclusion

 

 

 

the first stage: translating the Dutch lexical corpus into Arabic

As written elsewhere on this site , a Dutch lexical corpus (DLC) was made available to us by our principal the CLVV commission, as well as the database editor OMBI. When the actual execution of the project started, the team of editors was instructed on how to use the editor and the structure of the Dutch lexical corpus.

And then te work started: translating the Dutch words and expressions into Arabic.
The Dutch corpus was delivered to us in parts, starting with the nouns. The nouns starting with letter D were available first, followed by nouns with E and later K. The files were in SGML format, a format that can be imported into the database with OMBI. An example of such an SGML file can be viewed through the following link.

As mentioned elsewhere, the size of the complete DLC was too large for our project, so we had to make a selection from the DLC (RBN). The total DLC contains about 46.000 entries, which number has been reduced in our database to 37.000 Dutch entries in the Dutch-Arabic part. This process of selection was carried out by two persons, since we realized a decision concerning the maintenance or deletion of a word or expression should always be taken by two persons. This process of selecting from the DLC has been estimated to have cost us 438 hours.

 

The total size of the DLC is:

46.603 entries (Form Units) (6.283 verbs, 29.495 nouns, 6.344 adjectives, 128 interjections, 927 adverbs, 426 other function words, 3000 geographical names).
These 46.603 Form Units contain 71.000 meanings (Lexical Units, LU's), 89.000 examples/expressions (EU's), 900 idioms (IU's)
 

As for the units that were deleted during the selection process, in general the following remarks can be made:

about 9.000 entries have been deleted.

about 27.000 meanings (lexical units) have been deleted

39.000 expressions have been deleted

Through the following link some scanned pages can be viewed, on which units that were to be deleted are highligted.
        page 1        page 2

 

During the selection process we noticed that part of the DLC needed some editing as well. So during the process of deleting we also entered changes, consisting of shortening the sentences of expressions, or, in some cases, making expressions more explicit in order to facilitate the translation into Arabic.

And even in the final stage of proof reading it was decided some editorial changes needed to be made to the Dutch definitions of the meanings of polysemic words.

 

The translation of the DLC into Arabic.
The most time-consuming task of the whole project has without doubt been the translation of all the  Dutch words and expressions.
The editors did spend thousands of hours on finding translations for these.

As mentioned elsewhere, most of the existing Dutch-Arabic dictionaries do not contain that many examples. If we had only had to translate these 37.000 headwords, without the expressions,  this would have saved us disproportionately  much time, since it was mainly the translation of the examples that demanded the use of time-consuming tools an discussions. The use of a concordancy program, the use of various other dictionaries, all these activities can demand many minutes for just one translation.

As showed elsewhere  a significantly higher percentage of expressions than words (lexical units) had to be translated with explanatory descriptions.

And where the addition of a bi-directional translation was at the same time an investment in the creation of the Arabic lexical corpus (ALC) and the Arabic-Dutch part, the creation of such a huge number of unidirectional descriptions did not have any effect on the development of the Arabic-Dutch part.

The tools we have used in translating the Dutch words and expressions, of which some already have been mentioned  were:

  1. lexical knowlegde of the editors

  2. consultation of and discussions with other editors

  3. other dictionaries

  4. Dutch-Arabic database with rough materials

  5. text corpus with concordancy program

I will discuss these tools and show some examples.

 

1 Lexical knowlegde of the editors
It goes without saying that the editors who participated in the execution of the project were selected for their excellent command of both languages involved, i.e. Dutch and Arabic. In a different section of this website I present some information on the editors.

However, despite the excellent lexical knowlegde of the editors, many words and expressions could not simply be translated without consultation of reference works or other sources of information.
Not only was it difficult to find translations, in many cases did we consult the reference works in order to obtain conformation, since the act of adding a translation to a dictionary is more responsible than just writing a translation in a letter or a leaflet.

Other factors contributing to the difficulties in translating the Dutch words and expressions are:

the fact that many Dutch words simply do not have a translation equivalent in Arabic. More on this topic can be read in the section about descriptions.

the fact that the Dutch lexical corpus contains spoken language, a language variety which can only be translated in Modern Standard Arabic with great difficulty. Read about this in section 1.6.4.

Modern Standard Arabic is nobody's native language, it has been formally learned and taught at school. Read about this in section 1.6.3

 the fact that the Arabic lexicon is very large, but only a relatively small part of it is actively mastered by speakers of Arabic

So, although we were convinced to have gathered a team of excellent editors, it appeared that in many cases these persons had to consult reference works and tools in order to be certain about proposed Arabic translations.

 

2 Consultation of and discussions with other editors
Obviously the presence of a number of editors (with different linguistic backgrounds) is a major asset if it comes to finding solutions for difficult translation problems. So many hours have been spent in discussing possible translations. Discussion could concentrate on finding an accurate and concise description (explanation), about the status of a translation (an example or a description), about the currency of an expression (is this correct Arabic) etc. Needless to say the variation of the countries of origin of the editors will have contributed to the discussions as well. As can be seen in the section just referred to, the editors came from Morocco, Egypt, Syria and Iraq.
 

 

 

3 Other dictionaries
It is obvious that any dictionary editor benefits from the hard work of his colleagues and predecessors. In some cases this 'borrowing' from the work of colleagues becomes very obvious when one compares translations from different dictionaries. In the section of thise website on descriptions I have found some interesting examples of borrowing.

 

We have been using a great number of dictionaries during all stages of the compilation of our dictionaries. These dictionaries can be classified in different categories:

monolingual Arabic dictionaries

bilingual general dictionaries with Arabic as target language

bilingual general dictionaries with Arabic as source language

bilingual specialized dictionaries with Arabic as target language

bilingual specialized dictionaries with Arabic as source language

A full overview of the dictionaries that were used can be seen through the following link.

One reference dictionary should be mentioned here, since we have been using it very intensively during all stages of the project. This is the Larousse/ALECSO Basic Dictionary (al-mu'jam al-'asasi المعجم الأساسي). This monolingual Arabic dictionary, compiled especially for non arabophone learners of Arabic has proven to be a reliable and practical reference work. It will not be very difficult to trace some borrowing from this dictionary in our dictionaries, especially in the field of expressions. I was encouraged by Dr. Ali Al-Kasimi, ALECSO coordinator for the Basic Dictionary Project, to borrow from his work, and so we did. So, after finishing our project I want to thank Dr. Al-Kasimi for this permission, as well as for the answers he gave to a number of questions he allowed me to ask him for some 'difficult' words to be translated into Arabic.

Another conclusion after using so many different dictionaries of Arabic concerns the so called specialized dictionaries. In the list  above, a considerable number of these dictionaries is mentioned. In particular the Unified Dictionaries of the Bureau of Coordination of Arabization of ALECSO in Rabat cover 75% of all specialized dictionaries.

In many cases these specialized dictionaries do not agree with each other on the translation of specialized terminology from different fields. Furthermore, the terminology presented in these specialized dictionaries can very often not be verified in actual usage, since even in a large size text corpus the terminology presented in these dictionaries can not be found.

Additional research on this topic could lead to interesting conclusions.

The other already existing Dutch-Arabic dictionaries were almost not used by the editors of the present project. The Van Mol Learner's Dictionaries did not come out until 2001, when we already had finished the stage of translating the DLC into Arabic. As for the two other already existing dictionaries (Amiens and Al-Manhal), we considered them insufficiently reliable to use, so despite their presence we have not consulted these dictionaries very often.

 

4 Dutch-Arabic database with rough materials

During a period of some years preceding the actual execution of the project I started to collect and join a number of vocabulary lists, for example from various teaching materials I was preparing for my students. These materials consisted of the complete vocabulary list of the Arabic Textbook for beginners, written by Krahl and Reuschl, but also more specialized materials like word lists with newspaper articles or radio broadcasts, and texts in a course on 'Business Arabic'. All these lists were joined together in an Access database which, when we started the execution of the project, consisted of 55000 records.
A small example of this table is shown through this link.

This table contained a lot of usefull 'rough material' but it also contained a lot of redundant information, since basic vocabulary could be entered from different lists.

This Access database did not only consist of a table with Dutch words and their translations, it also contained a table with collocations and one with idiomatic expressions. Not only had I been collecting word lists in the pre-execution stage, I had also been collecting collocations and idiomatic expressions during all the reading of and listening to Arabic texts. These were texts I had been using for my courses, as well as texts that were presented to me for translation into Dutch. The content of these tables proved useful in the stage of completing the Arabic lexical corpus (ALC), which will be discussed later.

 

5 Text corpus with concordancy program
The execution of the project started in 1997, but the preparations had started several years before. An important part of the preparational work had been the compilation of an Arabic text corpus. This process was started in 1994 with scanning and OCR-processing a number of novels and non-fiction articles from magazines. By that time, not much Arabic text was available on the internet. Most available Arabic text on the internet was in in the format of images in those days, a format not suitable for electronic processing. So the rather time-consuming method of scanning seemed useful then. Furthermore we did obtain a complete volume of the London based newspaper Al Hayat, since this newspaper was the first to offer its archives in electronic format.

When we started the execution in March 1997 we had at our disposal a text corpus of 3 million words. More texts were available, since the complete Al Hayat volume contains more than 3 million words, but in order to meet with the requirement of a balanced corpus, I decided not to include all the Hayat articles in the corpus.

A list of all the scanned novels and nonfiction texts is available, and can be viewed this link.
During the years of the execution of the project, more and more Arabic text became available through the internet. The press agency Quds Press was probably the first news agency to make its news releases available on the internet. The site of the Arab Writers Union in Damascus made it possible to extend the volume of literary texts in the corpus with over 24 MB of text http://www.awu-dam.org/. Various newspapers started publishing daily issues on the internet, among them Al Ahram and many others. Al Hayat is untill this date available in pdf-format, but using the search option on their website results in text that can be copied and saved in text format.

During the stage of translating the DLC it was felt we lacked information on sports terminology, but through the various newspaper websites it was relatively easy to collect a number of texts on sports. I did read a number of these pages and stored interesting terminology, collocations and idiomatic expressions in the already described Access database.

A similar exercise has been carried out with files from the Islamic News Agency.

When even Arabic search engines (like Ayna: www.ayna.com) became available, the whole internet became like a gigantic corpus of Arabic texts. However, the reliability of the internet as a source of linguistic information is of course doubtful, just as the reliability of all the information on the internet is sometimes doubtful.

To illustrate the technological progress that took place during the stage of execution of the project, it is worth mentioning that by the end of the project the ALECSO Unified Dictionaries even became available on-line. They can be searched in three languages (Arabic, English and French) and can be reached through the following link: http://www.arabization.org.ma/Dictionnaire.asp

A text corpus consisting of several millions of words can only be exploited with a concordancy program. For this purpose we have used the program Monoconc, produced by Athelstan (http://www.athel.com/). There is a separate section about the way we have been using this concordancy program during the project.

 

the second stage: completing the Arabic lexical corpus
After the first stage, the translation of the Dutch lexical corpus (DLC), had been finished, we had created the core of the Arabic lexical corpus (ALC), however, this ALC still had to be extended.

It is not possible to trace the exact size of the ALC at the beginning of the second stage, but the numbers for the content of the database at that time indicate the following:
 

Language

Dutch

Arabic

Form Unit

37429

18429

Lexical Unit

44139

22688

Example

50740

32966

Description

3454

23032

Total meanings (LU+EX+Descr)

98333

78686

 So there was an imbalance between the two parts of the dictionary which had to be corrected. Explanations for the existence of this imbalance have been presented elsewhere on this site.

Another explanation for this imbalance lies in the existence of lexical gaps in Arabic in relation to Dutch. This phenomenon is being described in a separate section of this website.

The ALC had to be extended on different levels. A paragraph about this can be read in section 1.2.3.

So, first of all we needed to add Arabic words, since a vocabulary of only 18.429 is not sufficient for a translation dictionary. Secondly the number of examples, expressions and collocations had to be extended, as well as the number of idiomatic expressions.

Different steps were taken in order to correct the just mentioned imbalance.

  1. many texts from various sources have been read to extract expressions, collocations and lacking Arabic words.

  2. texts from specific domains have been read in order to extract additional Arabic words and expressions.

  3. other dictionaries were used as reference lists

  4. a list called ar-Raseed al-lugawiy was used

  5. frequency lists were used

  6. a frequency list was made with the concordancy program and the corpus of texts

  7. a list of roots from the dictionary of Hans Wehr was used as reference

  8. use of the memory of native speakers in order to extend the number of expressions and collocations

I will describe these different steps briefly.

 

1 Texts from various sources have been read to extract expressions, collocations and lacking Arabic words.
In order to extract words in their context or combinations of words  (i.e. collocations or expressions) I have read many texts in electronic form, i.e. on the screen of the computer. Interesting words or combinations of words could be copied to separate files, and then be imported as txt-files into an Access database. After importing the words, collocations and expressions in separate tables, I processed these data by adding

Dutch translations,

the head words under which the combinations or expressions should be stored

roots to the Arabic head words in order to make alphabetical ordering in Arabic possible.

The data from these tables were used by the editors of the dictionary in order to extend the number of Arabic entries, lexical units and expressions.
It is worth mentioning that all data from these tables that were included in the dictionary had to be retyped, since the data from the textcorpus is always without vowels, whereas the data in the dictionary database is entered with diacritics.

Examples of these database tables can be seen through the following link.


2 Texts from specific domains have been read in order to extract additional Arabic words and expressions.
As already mentioned in the paragraph about the first stage, texts from specific domains like sports, islam, health etc. were read with the specific intention to extract lacking words and expressions from them.
Since the source of all data in the database with rough materials was always entered, it is possible to show the data extracted from these texts. Not all of these data have been entered into the dictionary database, but a substantial percentage of it has been entered.
Through the following link, some examples of these data can be viewed.

 

 

3 Other dictionaries were used as reference lists
When the project of our Belgian colleague Mark van Mol reached its final stage, we obtained a list of the macro structure of his Arabic-Dutch part. So it was possible to make a comparison between both lists. This comparison resulted in a list of about 4000  words that were in the list of Van Mol, but not in our dictionary. After a thorough examination of this list, the number of words that could be considered omissions in our list was around 250. There were differences in spelling of foreign words that had to be left out of the list, and we concluded that Van Mol had included a considerable number of masdars and participles that did not have a specific meaning. It has been our policy not to include such masdars and participles, but given the fact that the Van Mol dictionaries are learner's dictionaries, his decision to include such words seems justified.

Another dictionary that has been used as a reference is the already mentioned Larousse/ALECSO Basic Dictionary (المعجم الأساسي).

And finally, another Larousse dictionary was used to check the macro structure of our dictonary, this was the As-Sabil dictionary Arabic-French by Daniel Reig. Since the macro structure of this dictionary contains 25.000 entries, it was considered a good reference list.

 

4 A list titled ar-Raseed al-lugawiy was used
We have also used a list prepared by the ALECSO (الرصيد اللغوي العربي، لتلاميذ الصفوف الستة الاولى من مرحلة التعليم الأساسي)containing words in dfferent fields and various levels, corresponding with the different classes in primary education in the Arab countries. The exact size of this list is unknown, since it is not mentioned in the introduction of the book. I have estimated the list contains over 15.000 words.
Several hundreds of words were added to the ALC as a result of the use of this list.

 

5 Frequency lists were used
We have used existing frequency lists of the Arabic language, but this did not result in many additions, since existing frequency lists cover limited numbers of words.
A separate page presents you some more information on these frequency lists.

 

6 A frequency list was made with the concordancy program and the corpus of texts
As described elsewhere  we have created a very rough frequency list by processing the corpus with a number of search-and-replace operations to separate frequent prefixes and suffixes from words in the corpus. Frequent errors resulting from such replacements were subsequently restored. This has resulted in a list of many thousands of words, which were used to check against the entries of the ALC.

Although this is a very rough method, it has contributed to the completion of the ALC.

 

7 A list of roots from the dictionary of Hans Wehr was used as reference
A list of all roots from the Hans Wehr dictonary can be found on the website of Tim Buckwalter. I have compared this list with a list of roots from our dictionary, and thus I was able to add some additional words that were not in our dictionary. Since many of these words seemed rather unusual, for practical reasons I have entered another criterion to check for the usefulnesss of the word. This criterion was if the word in question was included in the Al Qamoos electronic dictionary by Sakhr.

However, this list of roots from Hans Wehr also appeared to contain a rather large number of roots that could not be found in Hans Wehr, neither in the 1980 Arabic-English edition, nor in the 1985 Arabic-German edition. 

 

8 Use of the memory of native speakers in order to extend the number of expressions and collocations
Native speakers from the team of editors were asked to enlarge lists of collocations and expressions from memory. This resulted in some additions, but it has been our experience this was not the most effective way of expanding the ALC. It is my assumption that the fact that no single person speaks MSA as his real mother tongue might be the cause of this lack of ability to expand the list by introspection.

So by the end of the project, after going through all the steps described above, we obtained an Arabic lexical corpus that seems rather complete. During the last months before the actual completion some colleagues have tested the content of the database with questions concerning words or combinations not found in existing dictionaries, and, fortunately, in almost all cases our dictionary in its pre-final form was able to answer their questions.

The following table shows the numbers of units in both languages after the first stage (same numbers as in the table at the beginning of this paragraph), and after the second stage, including a number indicating the increase in terms of percentage.

 

Language/Unit

Dutch, end of Stage 1

Dutch, end of Stage 2

Increase in %

Arabic, end of Stage 1

Arabic, end of Stage 2

Increase in %

Form Unit

37429

 37704

 0.7%

18429

 24236

 31%

Lexical Unit

44139

 44019

 -0.3%

22688

 31050

 36%

Example

50740

 50205

 -1.1%

32966

 43486

 31%

Description

3454

 14351

 315%

23032

29509 

 28%

Total meanings (LU+EX+Descr)

98333

 108575

 10%

78686

 104045

 32%

 

From this table it becomes clear that the ALC has increased considerably (over 30%), whereas the DLC slightly decreased.
Since Ar-Du descriptions are not part of the DLC, I do not consider the strongly increased number of Ar-Du descriptions (from 3454 to 14351, i.e. a 315% increase) as an increase to the DLC since these Dutch descriptions are unidirectional translations which only appear in the Arabic-Dutch part. This strong increase can be explained by the fact that in the second stage many Arabic words were entered into the database for which no Dutch equivalent existed. It is obvious these words had not been entered during the first stage when translations from Dutch into Arabic were entered into the database.

 

 

Conclusion
I take the view that the Nijmegen Dictionaries of Arabic haven been compiled using innovative ways in the field of Arabic lexicography. These innovative methods have resulted in a relatively high number of collocations (see a separate section of this website on collocation
) in comparison to existing dictionaries of Arabic.

reactions to: j.hoogland@let.kun.nl
last updated 04/11/2004 09:48 +0100
Back Home Next