Back Home Up Next

The Nijmegen Arabic/Dutch Dictionary Project

1.6.5 the lack of reliable frequency lists of MSA

During the execution of the project I consulted several frequency lists of Modern Standard Arabic. First of all frequency lists are the dictionary maker's assurance that important (frequent) words are not left out of the dictionary.

The second reason for consulting frequency lists would be the need to use them as guideline to extend the macro structure of the dictionary.

The following frequency lists were available and consulted in a more or less intensive way:

bullet

Fromm: Häufigkeitswörterbuch der modernen arabischen Zeitungssprache, Arabisch-Deutsch-Englisch

bullet

Kouloughli, D.E.: Lexique fondamental de l'arabe standard moderne/Basic Lexicon of Modern Standard Arabic

bullet

Sabuni, Abdulghafur: Wörterbuch des arabischen Grundwortschatzes

bullet

qa'ima makka, Umm al Qura University, Mekka

I have added some pdf-scans of these frequency lists.

Kouloughli    page 1     page 2     page 3

Sabuni          page 1     page 2     page 3     page 4     page 5

Fromm          page 1     page 2     page 3     page 4     page 5

qa'ima makka    page 1    page 2    page 3    page 4

 

However, all these frequency lists cover only the first few thousands most frequent Arabic words. It seems rather obvious that additional frequency data in Arabic are needed, especially for the so called middle segment of the Arabic lexicon: in addition to the most frequent 3-4000 words as presented in the just mentioned frequency lists, we are very much in need of a list of 10, 15, 20 and 25.000 words. As concluded elsewhere on this site, with 24.000 words a coverage of over 99% of all texts in MSA is reached.

First of all the missing frequency data could be used by dictionary makers. A second application would be the development of course books and other teaching materials, both for native and non-native speakers of Arabic.

However, a reliable frequency list can only be produced on the basis of a large tagged corpus, and a large tagged corpus can only be obtained through a computerized tagging system or lemmatizer. More on this topic can be read in section 1.6.6.

reactions to: j.hoogland@let.kun.nl
last updated 26/10/2003 16:45 +0100
Back Home Up Next