5 Must-know command-line commands for scientists
Why bother?
Why should you bother about such a nerdy tool? Well:
- You can do numerous things with LARGE files EXTREMELY fast and memory-efficient
- Quickly do things on a cluster without downloading data first
- Such modifications can be automated, unlike opening the file in Excel and modifying it
- With wildcards (like
*
or?
) you can directly do the same thing to multiple files at once
Here, I show you 5 commands to deal with datasets that can really improve your efficiency with datasets. And of course you can ask chatGPT to help you with the commands, so it is even easier. So you just need to know about the general capabilities of the command line.
Let’s go.
Command 1: head / tail
Simply gets the first or last lines of a file.
head file.txt
head -5 file.txt
tail file.txt
tail -f file.txt # follows while file is being written to --> check progress while a program is writing to that file
Command 2: awk
Gets you columns of data. You can also do computations with it
awk -F "<DELIMITER>" '{ COMMAND }'
Example:
# get first column of a file (delimited by spaces)
awk -F " " '{print $1}' file.txt
# get first two columns, sum up next three
awk -F " " '{print $1,$2,$3+$4+$5}' file.txt
Command 3: sed
Can replace text, add text, or remove text from a file based on a search string.
- Syntax:
- s: substitute
- g: globally (in entire file)
sed 's/<SYMBOL>/<REPLACEMENT>/g filename.txt
Example:
# Replace every mention of Year with YEAR in a file (note, this just writes the output to the terminal, it does not change the file)
sed 's/Year/YEAR/g' file.txt
# Replace every mention of Year with YEAR in all files starting with "myfile" and ending in ".txt" and do the replacement in-place in the files!
sed -i 's/Year/YEAR/g' myfile*.txt
Command 4: grep
It helps you find stuff within files. You can combine it with other commands using the |
(“pipe”) operator, for instance, to count the number of lines containing a certain word using the function wc
(word count).
# Note that grep 2010 cpool.out also works, but finds a line that contains, e.g., 4.420104
grep " 2010 " file.txt
# How many lines in the file contain 2010
grep " 2010 " file.txt | wc -l
# Do the same for all files within directories starting with "run"
grep " 2010 " run*/file.txt | wc -l
# Show also lines around the found lines (-C stands for "context"):
grep -C 2 " 2010 " file.txt
Command 5: md5sum
Can immediately tell you whether two files are exactly the same. It computes a “hash” of the file, basically just a weird-looking combination of letters and numbers (technically, it’s actually a number, but we don’t need to bother about this now). If there is the tiniest difference in a file, the command will produce a completely different hash.
Assume we three files of which two are actually identical. We can easily check this via:
md5sum file1.txt file2.txt file3.txt
0e64fe596b2fd856d50c2a5e42e3fe0d file1.txt
ccf28066f0b00c85be1f9a76358ae8f7 file2.txt
0e64fe596b2fd856d50c2a5e42e3fe0d file3.txt
One final thing: history / ctrl+r
The history
command shows you all commands that you made previously, so you can reuse them. Pressing CTRL+r on your keyboard also lets you search for commands you have executed.
Summary
So, I hope these short examples could show you how powerful the command line can be. Now that you know that this is possible, you can easily ask chatGPT “How do I replace all occurrences of the word ‘hello’ in a text file with ‘hi’ using the command line?”. Hopefully, this can make your work more efficient :)
Enjoy Reading This Article?
Here are some more articles you might like to read next: