January 3, 2022

Do Not Unpack Packed Text Files!

If you are doing modern document processing, you probably operate on large text files, for example Wikipedia dumps (you know it’s stupid to crawl Wikipedia, when they provide you with a nice dump, right?) or CommonCrawl dumps (let’s say, the dump for the 2012 French Internet prepared by people from the statmt.org page). At Applica, we don’t stop here, instead we combine one-dimensional text processing with two-dimensional document processing and computer vision, but still it’s good to have large resources of purely textual data for doing neural language modeling.

I’m a Simple Person. I See a Compressed File, I Click It to Decompress.

A mistake many people are making is to unpack such large text files. Don’t do that! You would waste your time and disk space. Plain text compresses quite well: for example the Wikipedia dump takes 80 GB decompressed and only 18 GB when compressed. The better option, rather than unpacking the file, is to extract its content every time you need it.

Won’t Uncompressing Each Time Slow You Down?

But it seems that extracting once should reduce CPU cycles and time? Actually, a CPU is faster than your disk (especially a magnetic one).

Ok, But I Need to Process the Decompressed Contents…How?!

One option is to use command-line commands such as zcat, bzcat or xzcat for, respectively, gzip, bz2 and xz compressors, and combine them with other commands/programs forming a pipeline. Let’s say, you’d like to count how many times Marian Rejewski occurs in Wikipedia, this simple task can be carried out with the following pipeline:

The file will not be decompressed, but rather its content will be decompressed and sent, on the fly, as the input to the grep command, and finally wc -l will be used to count lines.

How to Do It In Pure Python?

What if you need to do whatever you’re supposed to do in Python? You might try to combine a Python script reading from a standard input with zcat/bzcat/xzcat:

This is actually not a bad idea as shell is running decompression (bzcat) and your script in parallel on two separate CPU cores, but if you want to use a pure Python solution, you should use bz2 module (or gzip and lzma for, respectively gzip and xz):

Which Compressor to Choose?

If you read from a compressed file provided by somebody else, you don’t have any choice. But what if you have a large text file and want to compress it, which compressor is the best? (I hope I have already convinced you it is a good idea to store text files compressed and process them in this format.)

My personal choice is xz, it compresses with a high compression ratio, arguably it is quite slow for compression, but, still, very swift as far as decompression is concerned, which is OK for a scenario when you compress once and decompress many times on the fly. Let’s do some simple experiments for the Wikipedia dump. I focused on command-line compressors used for compressing single files, I didn’t consider popular archivers such as ZIP or RAR with which you can pack more than one file.

Time measurements should be taken with a grain of salt, as it is very hard to control conditions such as disk caching, but it is still noticeable that decompression times for gzip and xz are quite good. What was a little bit surprising for me was the slowness of xz compression. I had expected a big number here, but not such a big one! I also added zstd, a relatively new contender, which is incredibly fast as far as both compression and decompression are concerned. Still, the Scrooge in me would miss the 5 GB size difference between xz and zstd.

Continue reading

TILT Can Go Where Others Can’t
What makes Applica the only solution in its class that can do what we do? Specialization. And research.
A Deeper Look at TILT – Applica’s Revolutionary Deep Learning Tech
We’ve figured out a way to give our clients more speed, more precision, more control, and more options, while again setting an entirely new standard in the document automation game.
Improving Work-Life Balance With Extra Paid Leave
Applica’s team members are everything to us, which is why we trust them to take the time that they need.