It probably won’t come as a surprise to anyone if we say that the most popular text editing tools, such as MS Excel or LibreOffice Calc, do not always turn out to be the best for preparing annotated data. Therefore, we would like to say a few words about Sublime Text Editor, which, in our practice, has proved to be very useful when it comes to editing extracted information from documents in Arabic.
- What turned out to be problematic in the Arabic data?
- And why is Sublime Text Editor the one we decided to use?
These questions are answered below. But first things first...
I’ll let you guess: “Arabic is an RTL language and that’s the problem?”
The answer is: yes and no. The Arabic language is a Semitic language, written with the Arabic alphabet called abjad, that is consonantal, and written horizontally from right to left. Arabic is by far one of the most popular, if not the most popular, languages of this type. Others are, for example, Hebrew, Pashto or Persian.
Another distinctive feature of the Arabic writing system is that the letters vary in shape depending on their position within a word – e.g.: initial, medial, final or isolated – which is sometimes referred to as the IMFI writing system.
But perhaps the most unexpected thing about editing Arabic data for us was the fact that almost every document we analyzed contained data written in both directions. Not only are the Indic-Arabic numbers, as a rule, written from left to right, but it also happens that the Arabic numbers, the same ones that are used with the Latin script (e.g.: 1, 2, 3...), may also appear next to the Arabic words, written of course in the opposite direction. The names Indic-Arabic vs. Arabic are a bit confusing, I agree. In fact we are talking here about two systems of numerals: Indic-Arabic (so called Eastern Arabic numerals) used with the Arabic script and Western Arabic numerals used with the Latin script and sometimes with the Arabic script also. What’s more, a financial report written in Arabic may contain some elements written in the Latin alphabet (e.g.: some proper names) – see Fig. 1 below.