Selection of some useful bash tools
Wed 09 January 2019 by Franz Kirsten- Sometimes it is quicker to write a short bash script to 'get the job done' instead of doing it in python or something else
- bash comes with a load of very useful little programmes that allow you to modify files on the fly
- if you 'pipe' things together (with the
|
operator), i.e. send the output of one programme as input to the next, you can create a complex chain of task as a one-liner - the phylosophy behind the bash-tools is that they are designed to perform only one task -- but that task they perform with great performance!
- bash syntax can be annoying/confusing/frustrating, but you'll get the hang of it!
Some tools I constantly use
- some of the below tools I use in the script tex2md.sh that we discussed during TechTalk -- its aim is to convert a table set in tex into Markdown format.
grep
-- search for patterns in a file. Useful e.g. when debugging someone else's programme that throws an error for which you're trying to find the source code. I then typcally run grep 'error message' all/files/in/source/codesed
-- string editor. Extremely useful if you want to modify the same string in a large bunch of files.awk
-- actually a programming language. Extremely powerful to parse text files, do computations on the content of those files, modify the files content and/or write the results to a new file -- plus much more!let
-- a utility to create an integer counter if needed, e.g.:b=3; let b=$b+1 # increments b by one
cut
-- parse files column wisepaste
-- merge files line by line (i.e. combine two files as columnsdd
-- convert and copy files. I typically use it to cut large files into smaller onesseq
--creates a sequence of numbers. This list of numbers can increase/decrease by a user defined step size, output can be padded with zeros, and more
Parallelisation
There is 'real' parallelisation with a tool called mpibash,
but for most bash-related
tasks it's overkill. I typically parallelise things that are 'embarringly parallel', i.e. I
perform one and the same task over and over again. An example of this would be the
conversion of a bunch of images from, say, png to jpg. The different task are completely
independent of one another and, thus, easy to parallelise. The main idea behind this
is the ability to 'send things to the background' with the &
operator. Let's say
you have an image called 'sun.png' and you'd like to use convert
(comes with
ImageMagick) to convet it to 'sun.jpg'. You could just run
convert sun.png sun.jpg
While the programme is running you will not be able to do anything else because the
prompt is blocked. If you now append &
to the command like so:
convert sun.png sun.jpg &
the programme will run in the background and you can launch the next job on image sunny.jpg.
convert sun.png sun.jpg & convert sunny.png sunny.jpg &
You'll now have two instances of convert running in parallel. You could play this game with 100 or 1000 images, but then the different instances will be competing over the CPU-cycles of your, say, 4-core laptop. They'll all be running at the same time and, in the worst case scenario, everything might crash because you don't have enough memory (depending on the size of the images each instance might take up a lot of RAM). For that reason you might be better off having at most as many instances running as you have CPU cores on your machine. For that purpose we can define a little helper function that we shall call 'pwait':
pwait(){
while [ $(jobs -p | wc -l) -ge $1 ]; do
sleep 0.33
done
}
This function counts the number of running jobs in the current shell with the inbuilt
tools jobs
and wc
, waits for 0.33 seconds (with sleep
) in case the user-defined
maximum number of jobs is running and counts again. If one of the jobs is done it will
no longer block the running loop and the next iteration of the loop runs.
Thus we could run convert
on all .png images in the current
working directory like this:
for image in *.png; do
convert $image ${image%png}jpg &
pwait(4)
done
wait
The last wait
blocks the prompt until the very last job is done.
The syntax ${image%png}jpg
removes the file ending 'png' and replaces it with 'jpg'.
In case you want to convert only a smaller subset (list) of the images, you can create
the list and run through it in the following manner:
list='image1.png image2.png image3.png image4.png'
for image in $list;do
convert $image ${image%png}jpg &
pwait(4)
done
wait
Rsync and Cron jobs
- The slides I showed are in this rsync-cronjobs.pdf
- in a nutshell
rsync
is a fancy way to copy data - instead of copying everything it will syncronise source and sinc by comparing time tags and/or sizes of files
- can be used like
scp
to sync different machines in a …