Malay Rumi-Jawi Converter

Technical Details of the Converter

Algorithmic Details

The Rumi–Jawi converter is at its core a dictionary-based method, meaning that each Rumi or Jawi word is looked up in the dictionary and, if it exists, mapped to one or more forms in the other script. The results are concatenated together into the output box. This process has a few more steps that are detailed below.

Step 1: Tokenization

A computer does not know what a word is, so, from the perspective of the converter, the input box only contains a sequence of characters. Tokenization is the process of finding word boundaries (or token boundaries, more generally) in this character sequence. Unlike, say, Chinese or Japanese, Malay is written in both Rumi and Jawi with spaces between words and this makes the task considerably easier. There are still some challenges, however, such as punctuation. If words were only ever split at spaces, then it might act as though "makan," (including the comma) is a word and fail to find an entry in the dictionary, whereas it would have found "makan" (without the comma).

In the converter, sequences of letter characters (Latin or Arabic) are grouped into tokens that get converted while whitespace and most punctuation is ignored. Hyphens (-), however, are used frequently in both Rumi and Jawi for reduplication, and they are therefore included in word tokens. Commas, semicolons, and question marks appear differently in Rumi and Jawi and are therefore handled specially.

Step 2: Normalization

Once words have been found by tokenization, they are passed to a conversion function depending on the direction of conversion (Rumi-to-Jawi or Jawi-to-Rumi), and the first step of conversion is normalization. This step reduces variation that would make conversion difficult.

In Rumi-to-Jawi conversion, the only word normalization that is done is downcasing. Computers see A and a as different letters, so if the dictionary contained a word like kuning in the dictionary, it would not find Kuning.

In Jawi-to-Rumi conversion there is no downcasing as Jawi does not have upper and lower case, but there are other reasons for variation. In Jawi, the preferred letter for /k/ sounds is ک rather than the Arabic kaf ك, but some documents nevertheless use the latter, so these are normalized to the former. Similarly, Jawi uses ݢ for the /g/ sound, but sometimes people use گ or ڬ instead, so these latter two get normalized to the first one.

For conversion in both directions, commas, semicolons, and question marks are replaced with the appropriate version.

Step 3: Morphological Analysis

Malay has a robust morphological system of prefixes and suffixes which, for instance, change the root ajar ("teach"/"learn") to pelajar ("student"), ajaran ("precept"/"lesson"), pelajaran ("education"), belajar ("to learn"), mengajar ("to teach"), etc. Each of these words needs to be in the dictionary to be converted, but the system could be more robust if it could detect these affixes and convert them separately, because then it would only need to contain the roots and the affixes.

The converter does not yet do this level of sophisticated morphological analysis, but if it fails to find a word in the dictionary, the word ends in lah/له (a common discourse suffix), and the word without the suffix exists in the dictionary, then the word and the suffix are converted separately. For Jawi-to-Rumi conversion, words beginning with د (di) are similarly converted separately when the whole word is not in the dictionary. This is because the adposition di ("in"/"at") sometimes appears attached to words in Jawi where it would be a separate word in Rumi.

Step 4: Dictionary Lookup

Once all the tokenization, normalization, and morphological analysis is complete, the actual dictionary lookup step is trivial: if the word is in the dictionary, the mapped form is used; if not, the original word is retained.

Display Details

Aside from the algorithmic details of conversion, there are some additional technicalities in the way the results are presented.

Font

Jawi is an Arabic script but it uses characters not present in the Arabic language, so it is important to choose a font containing glyphs for these characters. This site uses Google's Noto Naskh Arabic font as it contains these glyphs and is a neutral typeface without many embellishments.

Writing direction

As Arabic scripts like Jawi are written from right to left and Latin scripts like Rumi are written from left to right, the input and output boxes of the converter are specified with these directionalities so the text displays appropriately. When the "Switch Direction" button is clicked, these directionalities are reversed.

Alternatives

Sometimes a word, whether in Rumi or Jawi, has multiple candidates for conversion. For instance, dan ("and") can be written both as دن and دان in Jawi, and سمبيلن can be either sembilan ("nine") or sambilan ("casual"). The converter tries to assist users in selecting the correct form by highlighting ambiguous conversions in blue. Clicking on the blue words shows the list of conversions at the top of the output box, as well as the original form. The user may then select the preferred conversion which will be used in the output. Be careful as any change to the content of the input box, such as typing a character or switching the direction of conversion, will wipe out any selections, so only make such selections when the input will no longer change.