How to leverage multi-core CPUs to speed up your Linux commands — awk, sed, bzip2, etc.

by anonymous on 2013-11-16 17:32:21

Use multiple CPU Cores with your Linux commands

Have you ever had the need to process a very large dataset (hundreds of GBs)? Or search through it, or perform other operations—ones that cannot be parallelized? Data specialists, I'm talking to you. You may have a 4-core or more CPU, but our standard tools like grep, bzip2, wc, awk, sed, etc., are single-threaded and can only use one CPU core.

In the words of cartoon character Cartman, "How do I use these cores?"

To make Linux commands utilize all CPU cores, we need GNU Parallel. It allows all of our CPU cores to perform magical map-reduce operations on a single machine, with the help of the rarely used --pipes parameter (also called --spreadstdin). This way, your load will be evenly distributed across all CPUs, truly.

BZIP2

Bzip2 is a better compression tool than gzip, but it's slow! Don't worry; we have a solution for that.

Old method:

cat bigfile.bin | bzip2 --best > compressedfile.bz2

New method:

cat bigfile.bin | parallel --pipe --recend '' -k bzip2 --best > compressedfile.bz2

Especially for bzip2, GNU Parallel is super fast on multi-core CPUs. Before you know it, the job is done.

GREP

If you have an extremely large text file, you might have previously done this:

grep pattern bigfile.txt

Now you can do this:

cat bigfile.txt | parallel --pipe grep 'pattern'

Or this:

cat bigfile.txt | parallel --block 10M --pipe grep 'pattern'

This second usage uses the --block 10M parameter, meaning each core processes 10 million lines—you can adjust how many lines each CPU core handles using this parameter.

AWK

Here’s an example of using the awk command to calculate a very large data file:

Standard usage:

cat rands20M.txt | awk '{s+=$1} END {print s}'

New method:

cat rands20M.txt | parallel --pipe awk \'{s+=\$1} END {print s}\' | awk '{s+=$1} END {print s}'

This is a bit complex: The --pipe parameter in the parallel command splits the cat output into multiple chunks and dispatches them to calls of awk, creating sub-calculations. These sub-calculations are then piped into the same awk command to produce the final result. The first awk command has three backslashes because that's what GNU Parallel requires when invoking awk.

Want the fastest way to count the number of lines in a file?

Traditional method:

wc -l bigfile.txt

New method:

cat bigfile.txt | parallel --pipe wc -l | awk '{s+=$1} END {print s}'

Very clever: First, the parallel command 'maps' out numerous wc -l calls, forming sub-calculations, which are finally piped into awk for aggregation.

SED

Want to perform a lot of substitutions using sed on a huge file?

Standard method:

sed s^old^new^g bigfile.txt

New method:

cat bigfile.txt | parallel --pipe sed s^old^new^g