Back Home Up Next

The Nijmegen Arabic/Dutch Dictionary Project

1.4.6 Using the concordancy program


A text corpus consisting of several millions of words can only be exploited with a concordancy program. For this purpose we have used the program Monoconc, produced by Athelstan (http://www.athel.com/).


Other lexicographers have used other products (PERL, WordCruncher), but we found Monoconc sufficient to cater for our needs. For treating Arabic text the program has one shortcoming, since search hits of an Arabic key word in context (KWIC) are presented on the screen in a wrong order. Reading from right to left (as Arabic should be read) we first read the context that comes after the keyword, then in the middle we find the key word, and after the key word we find the context preceding the key word. This seemed a serious shortcoming in the beginning, but soon we experienced that we got used to this very quickly. The following illustration shows the search screen of Monoconc with a number of search hits.

 


 

Because of the limitations and characteristics of the Arabic text corpus and the  Arabic language there were a number of problems to be overwon while using Monoconc. The fact that Arabic is written without vowels implies a great deal of ambiguity in the corpus, which cannot be disregarded during the concordance process. The fact that Arabic uses prefixes, infixes and suffixes on a very large scale, makes searching difficult. It is of course possible to use wild chards in a search string, but the use of these very often results in obtaining too many occurrences that were not wanted. Especially the occurrence of weak roots (second or third radical being hamzah, waw or ya) makes searching difficult. Monoconc offers the option of a batch search, i.e. to search for a number of strings in one search action, but this demands the entering of a number of search strings, and it results in a substantially longer search time, since the whole corpus is being searched for each search string. Another option is searching with regular expressions, but this option has not been used by us very often.


The concordancy program has been used for various types of searches:

- does a word occur in the corpus?

- does a combination of words (collocation, expression) occur in the corpus?

- with what words can a certain word co-occur (collocate)?

- what is the most frequent spelling of a word?

- what forms of a certain root do occur in the corpus?

- to produce a frequency list of the corpus


I will give some additional information on these different search types.

 

Does a word occur in the corpus?

It seems obvious that a text corpus of a certain size is a reliable source of information with regards to the actual usage of a word, i.e. if a word is used in written language. One should of course bear in mind that very specialized terminology can turn out to be absent in the corpus but still being used in specialized language. But for 'general vocabulary' it seems obvious the corpus can be relied on to decide if a word is being used or not.


Does a combination of words (collocation, expression) occur in the corpus?

Since it is possible in the concordancy program to search for combinations of words (with a number of words in between if necessary), the corpus was a reliable source of information with regards to collocations. Since collocations are often defined as frequent combinations, this implies such combinations should be present in the corpus. This will be demonstrated in the section on collocation.


With what words can a certain word co-occur (collocate)?

This type of search seemed useful in translating Dutch compounds. Compounding exists on a large scale in Dutch, whereas it does not exist in Arabic. So Dutch compounds are mostly translated in Arabic with a combination of two nouns or a noun with an adjective. An example to illustrate this type of search is the following. The English word 'nuclear' in combination with various nouns (weapon, war, power plant etc.) will in Dutch be translated with a number of compound words of which the first segment is the Dutch word 'kern'. Our Dutch-Arabic part contains the words 'kernafval, kernbewapening, kernbom, kerncentrale, kernenergie, kernexplosie, kernfysica, kernfysicus, kernkop, kernlading, kernmacht, kernonderzeeŽr, kernoorlog, kernproef, kernraket, kernramp, kernreactor, kernstop, kernwapen, kernwapenverdrag, kernwapenvrij' (nuclear waste, ~ armament, ~ bomb, ~ power plant, ~ energy, ~ explosion, ~ physics, ~ physicist, ~ warhead, ~ charge, ~ power, ~ submarine, ~ war, ~ test, ~ missile, ~ disaster, ~ reactor, ~ freeze, ~ weapon, ~ arms treaty, ~-free).

It is obvious that a concordance of the Arabic word for nuclear, i.e. nawawiy (šśśŪÝ) would supply us with a number of translations for these compounds.

A pdf-file of a part of this concordancy file can be viewed through the following links.

bullet

pdf-file containing some pages of the concordance-file

bullet

txt-file containing all matches (1.1 MB)

bullet

txt-file containing first pages (41 KB)

The pdf-file of the page containing these Dutch compounds starting with 'kern-' (nuclear) and their translations can be viewed through this link.


What is the most frequent spelling of a word?

Words of foreign origin can be spelled in Arabic in different ways. Vowels can be treated as short or long vowels in Arabic, consonants can be emphatic or normal in Arabic. This is demonstrated in the occurrence of different spellings of geographical names for example. With the corpus and the concordancy program it was easy to identify which spelling is most frequent and consequently had to be included in the dictionary.


What forms of a certain root do occur in the corpus?

In some cases we wanted to know which forms of a certain root do occur in the corpus. With roots consisting of three strong radicals it proved possible to find all occurring forms with the following search pattern: %%%%%%%R1%%R2%%R3*. R1, R2 and R3 represent three different radicals (fa, 'ayn and lam in terms of fa'ala patterns), % represents a wild chard substituting one or zero characters, * is a wild character substituting any string.

However, the percentage of roots than can be treated in this way is limited, since it is not only the weak radicals that can complicate the situation, but also radicals that can act as independent morphemes (initial lam can be the preposition li, initial fa can be the conjunction fa etc.). Buckwalter argues that only 37% of all roots can be processed without complications [Tim Buckwalter in an unpublished paper presented at the 1994 MESA conference].


To produce a frequency list of the corpus

This use of the corpus and the concordancy program will be treated in the section, in which the second stage is described.


In conclusion about the use of the text corpus and the concordancy program: in spite of the problems and shortcomings mentioned above, we could not have finished our dictionaries without the help of the concordancy program and the text corpus.

reactions to: j.hoogland@let.kun.nl
last updated 26/10/2003 15:16 +0100
Back Home Up Next