The Rumi–Jawi converter is at its core a dictionary-based method, meaning that each Rumi or Jawi word is looked up in the dictionary and, if it exists, mapped to one or more forms in the other script. The results are concatenated together into the output box. This process has a few more steps that are detailed below.
Step 1: Tokenization
A computer does not know what a word is, so, from the perspective of the converter, the input box only contains a sequence of characters. Tokenization is the process of finding word boundaries (or token boundaries, more generally) in this character sequence. Unlike, say, Chinese or Japanese, Malay is written in both Rumi and Jawi with spaces between words and this makes the task considerably easier. There are still some challenges, however, such as punctuation. If words were only ever split at spaces, then it might act as though "makan," (including the comma) is a word and fail to find an entry in the dictionary, whereas it would have found "makan" (without the comma).
In the converter, sequences of letter characters (Latin or Arabic) are grouped into tokens that get converted while whitespace and most punctuation is ignored. Hyphens (-), however, are used frequently in both Rumi and Jawi for reduplication, and they are therefore included in word tokens. Commas, semicolons, and question marks appear differently in Rumi and Jawi and are therefore handled specially.
Step 2: Normalization
Once words have been found by tokenization, they are passed to a conversion function depending on the direction of conversion (Rumi-to-Jawi or Jawi-to-Rumi), and the first step of conversion is normalization. This step reduces variation that would make conversion difficult.
In Rumi-to-Jawi conversion, the only word normalization that is done is downcasing. Computers see A and a as different letters, so if the dictionary contained a word like kuning in the dictionary, it would not find Kuning.
In Jawi-to-Rumi conversion there is no downcasing as Jawi does not have upper and lower case, but there are other reasons for variation. In Jawi, the preferred letter for /k/ sounds is ک rather than the Arabic kaf ك, but some documents nevertheless use the latter, so these are normalized to the former. Similarly, Jawi uses ݢ for the /g/ sound, but sometimes people use گ or ڬ instead, so these latter two get normalized to the first one.
For conversion in both directions, commas, semicolons, and question marks are replaced with the appropriate version.
Step 3: Morphological Analysis
Malay has a robust morphological system of prefixes and suffixes which, for instance, change the root ajar ("teach"/"learn") to pelajar ("student"), ajaran ("precept"/"lesson"), pelajaran ("education"), belajar ("to learn"), mengajar ("to teach"), etc. Each of these words needs to be in the dictionary to be converted, but the system could be more robust if it could detect these affixes and convert them separately, because then it would only need to contain the roots and the affixes.
The converter does not yet do this level of sophisticated morphological analysis, but if it fails to find a word in the dictionary, the word ends in lah/له (a common discourse suffix), and the word without the suffix exists in the dictionary, then the word and the suffix are converted separately. For Jawi-to-Rumi conversion, words beginning with د (di) are similarly converted separately when the whole word is not in the dictionary. This is because the adposition di ("in"/"at") sometimes appears attached to words in Jawi where it would be a separate word in Rumi.
Step 4: Dictionary Lookup
Once all the tokenization, normalization, and morphological analysis is complete, the actual dictionary lookup step is trivial: if the word is in the dictionary, the mapped form is used; if not, the original word is retained.