If you are doing modern document processing, you probably operate on large text files, for example Wikipedia dumps (you know it’s stupid to crawl Wikipedia, when they provide you with a nice dump, right?) or CommonCrawl dumps (let’s say, the dump for the 2012 French Internet prepared by people from the statmt.org page). At Applica, we don’t stop here, instead we combine one-dimensional text processing with two-dimensional document processing and computer vision, but still it’s good to have large resources of purely textual data for doing neural language modeling.
I’m a Simple Person. I See a Compressed File, I Click It to Decompress.
A mistake many people are making is to unpack such large text files. Don’t do that! You would waste your time and disk space. Plain text compresses quite well: for example the Wikipedia dump takes 80 GB decompressed and only 18 GB when compressed. The better option, rather than unpacking the file, is to extract its content every time you need it.
Won’t Uncompressing Each Time Slow You Down?
But it seems that extracting once should reduce CPU cycles and time? Actually, a CPU is faster than your disk (especially a magnetic one).
Ok, But I Need to Process the Decompressed Contents…How?!
One option is to use command-line commands such as zcat, bzcat or xzcat for, respectively, gzip, bz2 and xz compressors, and combine them with other commands/programs forming a pipeline. Let’s say, you’d like to count how many times Marian Rejewski occurs in Wikipedia, this simple task can be carried out with the following pipeline: