November 16, 2021

Complex Bi-Directional Text Data and Tackling it with the Sublime Editor

It probably won’t come as a surprise to anyone if we say that the most popular text editing tools, such as MS Excel or LibreOffice Calc, do not always turn out to be the best for preparing annotated data. Therefore, we would like to say a few words about Sublime Text Editor, which, in our practice, has proved to be very useful when it comes to editing extracted information from documents in Arabic. 

  • What turned out to be problematic in the Arabic data? 
  • And why is Sublime Text Editor the one we decided to use? 

These questions are answered below. But first things first... 


I’ll let you guess: “Arabic is an RTL language and that’s the problem?”

The answer is: yes and no. The Arabic language is a Semitic language, written with the Arabic alphabet called abjad, that is consonantal, and written horizontally from right to left. Arabic is by far one of the most popular, if not the most popular, languages of this type. Others are, for example, Hebrew, Pashto or Persian.

Another distinctive feature of the Arabic writing system is that the letters vary in shape depending on their position within a word – e.g.: initial, medial, final or isolated – which is sometimes referred to as the IMFI writing system.

But perhaps the most unexpected thing about editing Arabic data for us was the fact that almost every document we analyzed contained data written in both directions. Not only are the Indic-Arabic numbers, as a rule, written from left to right, but it also happens that the Arabic numbers, the same ones that are used with the Latin script (e.g.: 1, 2, 3...), may also appear next to the Arabic words, written of course in the opposite direction. The names Indic-Arabic vs. Arabic are a bit confusing, I agree. In fact we are talking here about two systems of numerals: Indic-Arabic (so called Eastern Arabic numerals) used with the Arabic script and Western Arabic numerals used with the Latin script and sometimes with the Arabic script also. What’s more, a financial report written in Arabic may contain some elements written in the Latin alphabet (e.g.: some proper names) – see Fig. 1 below.

Fig.1. Left: the name of the organization given in both Latin and Arabic characters (excerpt from the 2018 Dhofar annual report). Right: Amounts in the table are given in Western Arabic numerals (excerpt from the 2015 annual report of the Arab Fund for Economic & Social Development).


Which text editor did we choose to create a bidirectional .csv file… and why Sublime?

There are few people in the world who can boast great working knowledge of various text editors. I am not one of them. Therefore, I cannot say that Sublime is the only text editor worth recommending for working with Arabic data. Though we chose it because it turned out to be a very simple solution for the problems we had to deal with resulting from the bi-directional nature of our documents.

So how to use Sublime?

If you already have Sublime on your computer, you should download the appropriate package dedicated to bi-directional texts, e.g.: Sublime-2-Text-BIDI-master.zip that you can find for example here: https://github.com/praveenvijayan/Sublime-Text-2-BIDI. Once the package is on your computer, open the Sublime editor, click Preferences, then select Browse Packages, and copy the unpacked package you downloaded to the folder that opens. And that’s basically all. Now let’s see how it works.

Suppose we annotate the annual reports from which we want to obtain dates for financial settlements. And let suppose that the date that interests us is expressed in words, as in Fig 2.

Fig.2. Excerpt from the table with the financial settlement in which a part of the date (e.g.: 31st of December) is expressed in words.

Sometimes it’s easier to copy and paste the word or words we need to tag, especially when we work with languages we barely know or not at all. However, as simple as it may seem, it doesn't always work out as we intended. Original document formats vary. And in many cases, copying Arabic lettering does not work because the letters can lose their correct direction, and the result is a lettering that actually means nothing in Arabic (see the letters displayed in Fig. 3. on the left – they are written in the opposite direction to the letters you see on the right and which, in fact, mean ‘December’). If you right-click now and select Bidirectional selection, the entered text (on the left) will take the right-to-left direction (see the image on the right below to compare).

Fig.3. Left: the word meaning ‘December’ directly after being copied from a document and entered into the Sublime editor. Right: the same word after applying the Bidirectional selection function.


Now let’s look at the file that contains various data points: there are words written in Arabic script from right to left, there are dates and amounts written with Indic-Arabic numbers or with Arabic numbers (yes, the same ones that you master perfectly well when you use the Latin script), both written from left to right, and there are also words written in Latin. As you can see, the two text directions coexist, and I assure you that the file closes and opens correctly, the file formatting does not undergo unexpected modifications.

Fig.4. A file containing bidirectional data.

Conclusion

What conclusions can we draw from this experience? Well, first of all, working with foreign languages and with different writing systems is almost as interesting and full of surprises as working with different text editors. Arabic writing is fascinating and we at Applica are very curious about what other situations we may encounter in the future when working with it. Needless to say, I’m certain that we will learn a lot more as we continue to work with Arabic documents. Sublime proved to be a very easy-to-use tool when it comes to editing bi-directional texts, but it is not impossible that other editors will prove to be equally useful for us in the future.

Continue reading

Articles
Gaining the Loss Run Advantage
Streamline and expedite analysis of loss runs across the insurance industry.
Research
MLOps - Real-World Application
In the first part of this article, I will try to present and discuss the steps needed to prepare a machine learning model for deployment in production. The second part will cover deployment and automation of the ML system creation process.

MLOps - Real-World Application

Articles
Behind the Scenes With Applica Client Ultimo
Talking to a debt collection agency about their document automation needs.