GNU Parallel Note

· Read in about 2 min · (406 Words)

Introduction

GNU parallel is a shell tool for executing jobs in parallel using one or more computers.

If you use xargs and tee today you will find GNU parallel very easy to use as GNU parallel is written to have the same options as xargs. If you write loops in shell, you will find GNU parallel may be able to replace most of the loops and make them run faster by running several jobs in parallel.

GNU parallel makes sure output from the commands is the same output as you would get had you run the commands sequentially. This makes it possible to use output from GNU parallel as input for other programs.

Simple one

parallel echo {}\; cat {} ::: *.txt

Reading arguments from stdin and run

ls *.csv | parallel -j 4 wc -l

Redirection

parallel zcat {} ">" {.} ::: *.gz

Difference resource

parallel do_something {1} {2} :::: xlist ylist | process_output

Replacement strings

{}   original
{.}  removes the extension: removes two: {..}
{/}  removes the path | basename
{//} keeps only the path
{/.} removes the path and the extension
{#}  gives the job number
{%}  gives the job slot number

Quoting

# parallel perl -e 'print "@ARGV\n"' ::: This wont work
parallel -q perl -e 'print "@ARGV\n"' ::: This works

dryrun

parallel --dryrun echo {} ::: A B C

Avoiding printing half a line from one job to be mixed with half a line of another job.

parallel --linebuffer

Force the output in the same order as the arguments use –keep-order/-k

parallel -k

Tagging

$ ls *.csv | parallel --tag cat
1.csv line1
1.csv line2
2.csv line1

Parallel Vs xargs

  • xargs deals badly with special characters (such as space, ‘ and “).
  • xargs can run a given number of jobs in parallel, but has no support for running number-of-cpu-cores jobs in parallel.
  • xargs has no support for grouping the output, therefore output may run together, e.g. the first half of a line is from one process and the last half of the line is from another process.
  • xargs has no support for keeping the order of the output, therefore if running jobs in parallel using xargs the output of the second job cannot be postponed till the first job is done.
  • xargs has no support for running jobs on remote computers.
  • xargs has no support for context replace, so you will have to create the arguments.

Reference