On September 30 we will be celebrating the 36th anniversary of (the love it or hate it) Microsoft Excel spreadsheet. When it hit the commercial market two years after its creation, Excel quickly became the de facto standard for data storage and processing. Microsoft enjoyed such success with Excel that the competition imitated it almost 100%. With the rapid development of computer and digital technology, better methods of data storage (database systems such as SQL) and processing (scripted computer languages such as Python and R) have since emerged. However, due to the convenience of use and low entry threshold, Excel spreadsheets still dominate in business and non-computer science environments. Thus, a conflict occurs at a company such as Applica where we automate business documents via AI and progressive neural language models.
When a client comes to us they typically have their data in Excel, while we need the data in a much simpler and more convenient CSV format for processing through our automation solution. The problem seems to be a bit trivial, because Excel or LibreOffice Calc (Excel’s younger sister) have the ability to easily save a spreadsheet to a CSV file. And everything would be perfect if it wasn’t for the autocorrect feature built into these programs. At Applica, we often find that the customer’s data was uploaded incorrectly, simply because it was transformed by Excel autocorrect or had the wrong coding set. As a result, many hours are often spent improving or reconstructing the cells. Suffice it to say that the research of Marek Ziemann, Megan Soria, Mary Sravya Kasu, and Mandhri Abeysooriya showed that after surveying “more than 10,000 papers with Excel gene lists published between 2014 and 2020, they found more than 30% contained at least one gene name mangled by autocorrect”.
Avoid the Errors
Therefore, here is some advice that can help you avoid these types of problems going forward.
1. The file for Applica must be UTF-8 formatted without a BOM. UTF-8 is simply one of the most common encoding systems that is used fairly widely in computer science. While the BOM is a byte order mark, which may not be used in systems where the encoding method is predetermined (e.g.: you should use UTF-8, as anything else is asking for trouble), and because it always takes a few bytes, it is better to get rid of it.
2. To avoid autoformat values in Excel or LibreOffice Calc, be sure to set the cell formatting to text. When saving the file to CSV format, check “Quote all text cells”. It is also a good practice to verify the finished annotation file in a simple text program, e.g.: Notepad ++, Sublime, or even an ordinary Notepad.
3. If you want to open a CSV file with annotations in Excel, it is best to open a blank sheet first and then import the text data. During one of the import steps, you can set the cell formatting to text before loading the data (see Fig. 2).
The next four points may not be important for the correct reading of the data, but we apply them in our company, because they guarantee greater data transparency:
4. Individual columns must be separated by semicolons (;). In theory, as the format name implies (e.g.: .csv for comma-separated values), they could be commas, but semicolons are less used in everyday language, so they’re just more convenient. Tab would be even better, but then we would not be dealing with a CSV file but a TSV ;).
5. The names of the columns are also the names of the data points (e.g.: telephone, amount_paid, employee_name). They must be unique.
6. If there are data points that can have many different values, each of these values should be in a separate column, and the names of these columns should be different by adding a sequential number, e.g.: ‘telephone_1’,’ telephone_2’, ‘telephone_3’, etc.
7. In the case of data points to be grouped, the number added to the data point should be common to all data points in the group.
Therefore, the question arises: will following these pointers ensure that your data will always be loaded correctly? Well, only Technical Boy can know that for sure. But there is an old saying that if something is for everyone, it’s actually for no one, and perhaps that’s the same with autocorrect. However, following our recommendations will surely eliminate the most common problems with handling CSV files. But if you take nothing else from this article, always remember:
1 Ziemann M, Abeysooriya M, Soria M, Kasu M, 2021, Excel autocorrect errors still plague genetic research, raising concerns over scientific rigour, „The Conversation”, access: 2021-09-17.