Data Wrangling with Linux Command Line Tools

Motivation

You too can be a neckbeard.

I know many brilliant scientists and engineers who are quite adept at using tools like MATLAB, Python’s pandas library, or even Excel to analyze datasets and prototype algorithms. However, they are understandably unfamiliar with the powerful suite of command line tools available in Unix/Linux environments.

While it cannot replace full-fledged data analysis tools, the Unix/Linux command line toolset is an invaluable companion for quickly inspecting, cleaning, and transforming datasets. My hope is that, after enduring a small learning curve, you’ll come to appreciate these tools as much as I do. If you’re a Windows user (like most of us), you may eventually find yourself working exclusively within a Linux environment via the Windows Subsystem for Linux (WSL). Not only will you benefit from the tools described in this post, but your code libraries will likely run more reliably. Windows is a great desktop operating system, but the world of scientific computing largely revolves around Linux.

If you’re already a dirty neckbeard Unix/Linux user, you can probably skip this post. If you use a Mac, you already have access to a Unix-based terminal environment by default, so you can follow along just fine. Windows users will need to do a one-time installation of WSL. I recommend using the latest stable version of Ubuntu Server as your WSL distribution, which is 24.04 as of this writing.

Overview

In this post, I’ll cover the following topics:

Intro to the Linux/Unix terminal.
(Optional) WSL-specific tips and tricks for Windows users.
Reference sheet of containing the commands that I use most often, with examples.

There are many different ways of accomplishing the same task using command line tools. I’ll try to err towards simplicity and clarity, even if it means sacrificing some performance.

Intro to the Linux/Unix Terminal

I’m using WSL for all of the examples in this post. If you’re using a Mac or Linux machine, you can follow along just fine. I will use the words “terminal”, “command line”, and “shell” interchangeably to refer to the text-based interface for interacting with the operating system. You may also hear the term “bash”, which is a popular shell program.

After launching your terminal, you should see a prompt that looks something like this:

kevin@DESKTOP-G4PFJGJ:~$

The prompt typically contains your username (kevin), the machine name (DESKTOP-G4PFJGJ), and your current working directory (~, which is shorthand for your home directory). Having all of these immediately visible is very helpful when you’re connected to remote machines, as it helps you keep track of which one you’re working on.

Let’s write the words “Hello, World!” to the terminal. You can do this using the echo command:

kevin@DESKTOP-G4PFJGJ:~$ echo "Hello, World!"
Hello, World!
kevin@DESKTOP-G4PFJGJ:~$

Now let’s use the > operator to redirect the output of the echo command to a text file called hello.txt:

kevin@DESKTOP-G4PFJGJ:~$ echo "Hello, World!" > ./hello.txt
kevin@DESKTOP-G4PFJGJ:~$

You may be wondering why I used ./hello.txt instead of just hello.txt. The period (.) means “current directory”, so ./hello.txt explicitly specifies that the file should be created in the current working directory. This is a good habit to get into, as it helps with tabbed autocompletion in the terminal. More on this in a second.

If we display the contents of the hello.txt file using the cat command, we can see that it contains the expected text:

kevin@DESKTOP-G4PFJGJ:~$ cat ./hello.txt
Hello, World!
kevin@DESKTOP-G4PFJGJ:~$

When I typed the above command, I used tabbed autocompletion to save myself some typing. After typing cat ./h, I pressed the Tab key, and the terminal automatically completed the filename to ./hello.txt. This was a bit unnecessary in this toy example, but it can be a huge time-saver when working with long filenames or deeply nested directories.

If I repeat the same commands, but this time with a different message, the contents of the hello.txt file will be overwritten:

kevin@DESKTOP-G4PFJGJ:~$ echo "Goodbye, World!" > ./hello.txt
kevin@DESKTOP-G4PFJGJ:~$ cat ./hello.txt
Goodbye, World!
kevin@DESKTOP-G4PFJGJ:~$

To append text to the end of the file instead of overwriting it, we can use the >> operator. Let’s use this to create a simple CSV file:

kevin@DESKTOP-G4PFJGJ:~$ echo "Name,Age,Occupation" > ./people.csv
kevin@DESKTOP-G4PFJGJ:~$ echo "Alice,30,Engineer" >> ./people.csv
kevin@DESKTOP-G4PFJGJ:~$ echo "Bob,25,Data Scientist" >> ./people.csv
kevin@DESKTOP-G4PFJGJ:~$ cat ./people.csv
Name,Age,Occupation
Alice,30,Engineer
Bob,25,Data Scientist
kevin@DESKTOP-G4PFJGJ:~$

This brings us to the pipeline operator, the | character (usually located above the backslash key on most keyboards). The pipeline operator allows us to chain commands together, using the output of one command as the input to the next. For example, we can use the cat command to read the contents of the people.csv file, then pipe it to the wc (word count) command to count the number of lines in the file:

kevin@DESKTOP-G4PFJGJG:~$ cat ./people.csv | wc -l
3
kevin@DESKTOP-G4PFJGJ:~$

The -l flag tells the wc command to count lines, rather than characters. We actually could have used the wc command directly on the file without using cat, like so:

kevin@DESKTOP-G4PFJGJ:~$ wc -l ./people.csv
3 ./people.csv
kevin@DESKTOP-G4PFJGJ:~$

But this example illustrates how the pipeline operator works. If you don’t quite recall the options for a command, you can usually access its manual page using the man command. For example, to view the manual page for the wc command, you would run man wc.

ℹ️

The Unix Philosophy

A key concept behind the Unix/Linux command line is the “Unix Philosophy”, which emphasizes building small, modular tools that do only one thing well. By combining these tools using pipelines, users can accomplish complex tasks in a flexible and efficient manner. By adopting this philosophy in your own data analysis workflows, you can create powerful and reusable data processing pipelines that quickly adapt to changing requirements.

Lets add a couple more people to our CSV file:

kevin@DESKTOP-G4PFJGJ:~$ echo "Charlie,35,Teacher" >> ./people.csv
kevin@DESKTOP-G4PFJGJ:~$ echo "Diana,28,Engineer" >> ./people.csv
kevin@DESKTOP-G4PFJGJ:~$ echo "Fred,56,engineer" >> ./people.csv
kevin@DESKTOP-G4PFJGJ:~$ cat ./people.csv
Name,Age,Occupation
Alice,30,Engineer
Bob,25,Data Scientist
Charlie,35,Teacher
Diana,28,Engineer
Fred,56,engineer
kevin@DESKTOP-G4PFJGJ:~$

Now we’ll use the grep command to filter the rows based on a search term. For example, we can find all rows where the occupation is “Engineer”:

kevin@DESKTOP-G4PFJGJ:~$ echo "Charlie,35,Teacher" >> ./people.csv
kevin@DESKTOP-G4PFJGJ:~$ echo "Diana,28,Engineer" >> ./people.csv
kevin@DESKTOP-G4PFJGJ:~$ cat ./people.csv | grep "Engineer"
Alice,30,Engineer
Diana,28,Engineer
kevin@DESKTOP-G4PFJGJ:~$

Poor Fred didn’t get matched because his occupation is listed as “engineer” (lowercase “e”). The grep command is case-sensitive by default. To perform a case-insensitive search, we can use the -i flag:

kevin@DESKTOP-G4PFJGJ:~$ cat ./people.csv | grep -i "engineer"
Alice,30,Engineer
Diana,28,Engineer
Fred,56,engineer
kevin@DESKTOP-G4PFJGJ:~$

We can also use the -v flag to invert the search, returning all lines that do not match the search term. Let’s use this to find all rows where the occupation is not “Engineer”:

kevin@DESKTOP-G4PFJGJ:~$ cat ./people.csv | grep -iv "engineer"
Name,Age,Occupation
Bob,25,Data Scientist
Charlie,35,Teacher
kevin@DESKTOP-G4PFJGJ:~$

Like with the wc command, we could have used grep directly on the file without using cat, like so: grep -iv "engineer" ./people.csv.

Let’s add in the cut command, which allows us to extract specific columns from a delimited text file. By default, cut uses the tab character as the delimiter, but we can specify a different delimiter using the -d flag. We’ll use it to extract just the names of the engineers from our CSV file. While we’re at it, lets save these names to a new file called layoff-list.txt:

kevin@DESKTOP-G4PFJGJ:~$ grep -i "engineer" ./people.csv | cut -d "," -f 1 > layoff-list.txt
kevin@DESKTOP-G4PFJGJ:~$ cat layoff-list.txt
Alice
Diana
Fred
kevin@DESKTOP-G4PFJGJ:~$

With cut, you can specify multiple fields to extract using a comma-separated list (e.g., -f 1,3 to extract fields 1 and 3), or a range of fields using a hyphen (e.g., -f 2-3 to extract fields 2 through 3), or a combination of both (e.g., -f 1,3-4 to extract field 1 and fields 3 through 4). Please see the cut manual page for more details.

In practice, you may find yourself combining several commands together in a single pipeline to accomplish more complex tasks. If you’re like me, you may have a hard time remembering some of the commands you have used in the past. Luckily, the command line has a built-in history feature that allows you to recall and reuse previously executed commands. You can use the up and down arrow keys to navigate through your command history. Pressing the up arrow key will display the previous command, while pressing the down arrow key will show the next command. I often use the ”!!” shortcut to quickly re-execute the last command I ran. The benefit of using this versus the up arrow key is that it won’t clutter up your command history with duplicate entries.

You can also use the history command to view a list of the last thousand or so previously executed commands.

kevin@DESKTOP-G4PFJGJ:~$ history
    1  echo "Hello, World!"
    2  cat ./hello.txt
    3  echo "Goodbye, World!" > ./hello.txt
    [snipped for brevity]

By combining the history command with the grep command, you can search for a specific command you have used in the past. For example:

kevin@DESKTOP-G4PFJGJ:~$ history | grep cut
   42  grep -i "engineer" ./people.csv | cut -d "," -f 1 > layoff-list.txt
kevin@DESKTOP-G4PFJGJ:~$

You might notice that I omitted the quotes from the search term when using grep above. This works fine for simple search terms without spaces or special characters, but it’s generally a good idea to include the quotes to avoid unexpected behavior.

By using the exclamation mark (!) followed by the command number from the history list, you can re-execute a specific command. For example, to re-execute command number 42 from the history list, you would run:

kevin@DESKTOP-G4PFJGJ:~$ !42
Alice
Diana
Fred
kevin@DESKTOP-G4PFJGJ:~$

Recap:

The > operator redirects output to a file, overwriting its contents.
The >> operator appends output to the end of a file.
The | operator (pipeline) allows you to chain commands together.
The !! shortcut re-executes the last command without cluttering the history with duplicates.
The ! followed by a command number re-executes that specific command from history.
Up and down arrow keys navigate command history.
Tabbed completion saves typing and reduces errors.

(Optional) WSL Tips and Tricks for Windows Users

If you’re using WSL on a Windows machine, here are a few tips and tricks to help you get the most out of your experience. First, WSL mounts your Windows drives under the /mnt/ directory. For example, your C: drive is accessible at /mnt/c/. If you have a text file in C:\tmp\data.txt, you can access it from WSL at /mnt/c/tmp/data.txt:

kevin@DESKTOP-G4PFJGJ:~$ cat /mnt/c/tmp/data.txt
[contents of data.txt]
kevin@DESKTOP-G4PFJGJ:~$

However, you should think of WSL as being a separate machine inside your Windows PC. The two have to communicate over a slow network connection. This slows file access down by more than an order of magnitude. For example, running the wc command on a large text file located on the Windows file system might take several minutes:

kevin@DESKTOP-G4PFJGJ:~$ time wc -l /mnt/c/tmp/largefile.txt
10000000 /mnt/c/tmp/largefile.txt
real    5m45.123s
user    0m0.456s
sys     0m0.789s

For one-off tasks this may be acceptable, but if you plan to work with large files frequently, it’s best to copy or move them into the WSL filesystem for better performance:

kevin@DESKTOP-G4PFJGJ:~$ mv /mnt/c/tmp/largefile.txt ~/
kevin@DESKTOP-G4PFJGJ:~$ time wc -l ~/largefile.txt
10000000 ~/largefile.txt
real    0m3.456s
user    0m0.123s
sys     0m0.234s

When working with Git-backed projects, it’s generally best to clone the repositories directly into the WSL filesystem, rather than working with them on the Windows filesystem. The performance difference will be single-digit seconds versus multiple minutes for large repositories.

Sometimes you may prefer to work with files using the Windows File Explorer interface. Running the explorer.exe command from within WSL will open a Windows File Explorer window at the specified path. For example, to open the current directory in File Explorer, you would run explorer.exe .. This can be a handy way to drag-and-drop files between Windows and WSL.

You can also launch Windows applications directly from the WSL terminal. For example, to open Notepad with a specific file, you would run notepad.exe ./data.txt.

If you are a developer, you can launch Visual Studio Code from within WSL using the code command. This will open VS Code with the current directory as the workspace, i.e. code .. Then you can use the integrated terminal within VS Code to run WSL commands directly.

⚠️

Important Note About WSL Disk Usage

WSL stores all of its data in a file with a “virtual hard disk” (.vhdx) extension. The location is distribution-dependent. While WSL will automatically grow this file as needed, it will not automatically shrink it when files are deleted within WSL. As I do the majority of my work within WSL, I occasionally find that my WSL virtual hard disk has grown to tens of gigabytes in size, even though I’ve deleted large files within WSL. For this reason it’s generally best not to store large files in WSL long term, though I sometimes forget to clean up.

There are several ways to reclaim disk space if necessary. One common way is by using the diskpart command. Another alternative is to simply export my WSL distribution to a tarball (assuming you have enough free disk space to store it), unregister the distribution (which deletes the .vhdx file), then re-import it from the tarball. (This also is a handy way to migrate a WSL distribution to a new PC.) Once re-imported, the tarball can be deleted.

For example, from a Windows PowerShell prompt, you might run:

wsl —shutdown
wsl —export <DistroName> <TempFileName>.tar
wsl —import <DistroName> <InstallLocation> <TempFileName>.tar —version 2
del <TempFileName>.tar

Commonly Used Commands

These are the operators and commands that I use most often for data wrangling:

Operator	Description
>	Redirect output to a file, overwriting its contents.
>>	Append output to the end of a file.
\|	Pipeline operator, allows chaining commands together.
&	Run a command in the background. Useful for parallelizing long-running tasks.

Command	Description
awk	A powerful text processing tool for pattern scanning and processing. Think grep on steroids.
bg	Resume a suspended job in the background.
cat	Display the contents of files. Can also be used to concatenate multiple files.
cd	Change the current working directory.
cp	Copy files and directories.
cut	Extract specific columns from a delimited text file.
disown	Remove a job from the shell’s job table, so it will continue to run after you log off a server.
find	Search for files and directories based on various criteria, and perform actions on the results.
for	Loop construct for iterating over a list of items.
grep	Search for patterns within text files.
gzip	Compress files using the gzip compression algorithm.
gunzip	Decompress files compressed with gzip.
head	Display the first few lines of a text file. See also tail for the last few lines.
history	Display the command history.
less	View the contents of a file one page at a time.
ls	List the contents of a directory.
mv	Move or rename files and directories.
rm	Remove files or directories.
sed	Substitute text in a line.
shuf	Shuffle lines of text files randomly.
sort	Sort lines of text files.
split	Split a large text file into smaller files.
tail	Display the last few lines of a text file. See also head for the first few lines.
tar	Archive multiple files into a single file, optionally with compression.
top	Display real-time system resource usage and running processes.
uniq	Remove duplicate lines and counts occurrences. *Caution: Only works on adjacent duplicate lines; use in combination with sort** for best results.
unzip	Decompress files compressed with the zip algorithm.
watch	Run a command periodically, displaying the output in the terminal.
wc	Count lines, words, and characters in text files.]
zip	Compresses files using the zip compression algorithm. Prefer tar and gzip instead.

Detailed examples

awk

awk is like grep on steroids, and is a programming language in its own right. As a simple example, let’s say we have a CSV file called data.csv with the following contents:

Name,Age,Occupation
Alice,30,Engineer
Bob,25,Data Scientist
Charlie,35,Teacher
Diana,28,Engineer
Fred,56,engineer

We can use awk to extract the names of all engineers over the age of 30:

awk -F, '$3 ~ /[Ee]ngineer/ && $2 >= 30 { print $1 }'
Alice
Fred

Although quite powerful, I generally prefer to have a proper code debugger when writing more complex data processing scripts. My most common use of it is for combining multiple CSV files into a single file, while removing duplicate header rows. For example:

awk 'FNR==1 && NR!=1 { next } { print }' *.csv > combined.csv

Please see Vijay’s Stack Overflow answer for more details on how this works.

bg

Switch a process from the foreground to background. I almost always use in combination with the disown command.

cat

# Combine the contents of file1.txt and file2.txt into a new file called combined.txt
cat file1.txt file2.txt > combined.txt

# Combine the contents of all text files in the current directory into a new file called combined.txt
cat *.txt > combined.txt

# Display the contents of combined.txt
cat combined.txt

cd

Change directory.

# Change to the /mnt/c/tmp directory
cd /mnt/c/tmp

# Change to the home directory
cd ~

cp

Copies files and directories.

# Copy file1.txt from C:\temp to home directory in WSL
cp /mnt/c/temp/file1.txt ~/

# Copy all text files from C:\temp to home directory in WSL
cp /mnt/c/temp/*.txt ~/

# Copy a directory and its contents recursively
cp -r /mnt/c/temp/myfolder ~/myfolder_backup

cut

Selects specific columns from a delimited text file.

# Imagine we have a CSV file called people.csv with the following contents:
# Name,Age,Occupation
# Alice,30,Engineer 
# Bob,25,Data Scientist
# Charlie,35,Teacher
# Diana,28,Engineer

# Extract the first column (Name) from people.csv
$ cut -d "," -f 1 ./people.csv
Name
Alice
Bob
Charlie
Diana

# Extract the second and third columns (Age and Occupation) from people.csv
$ cut -d "," -f 2-3 ./people.csv
Age,Occupation
30,Engineer
25,Data Scientist
35,Teacher
28,Engineer

# Extract the first and third columns
$ cut -d "," -f 1,3 ./people.csv
Name,Occupation
Alice,Engineer
Bob,Data Scientist
Charlie,Teacher
Diana,Engineer

disown

This is useful if you are running a long-running job on a remote server via SSH, and you want the job to continue running after you log off.

# Start a long-running process that requires some user input in the beginning
process_database_records.exe --username kevin
Enter password: ********

# (Press Ctrl+Z to suspend the process)

# Resume the process in the background. This will print a list of background jobs.
bg

# Remove the job from the shell's job table, so it will continue to run after you log off.
# Here, %1 refers to the job number. Use 'jobs' command to see job numbers. It is also possible to use %%, which refers to the most recent job.
disown -h %1

# Run the 'top' command to verify that the process is still running.
top

# Log off and go home for the night. The process will continue running on the server.
exit

find

This is one of the useful commands in the shell. It allows you to search for files and directories based on various criteria, and perform actions on the results.

# List all files and directories in the current directory and its subdirectories
find .

# Find all .txt files in the /mnt/c/tmp directory and its subdirectories
find /mnt/c/tmp -name "*.txt"

# Find all .log files larger than 10MB in the /var/log directory and delete them
find /var/log -name "*.log" -size +10M -exec rm {} \

for

The for command is a loop construct for iterating over a list of items. For example, you can use it to process multiple files in a directory using a Python script:

# Loop over all CSV files in the a given directory and it's subdirectories, and pipe each one into a Python script for processing.
for file in path/to/inputdir/**/*.csv; do
    python3 process_data.py --input-file "$file" --ooutput-file /path/to/output/dir/"${basename $file}"_processed.csv
done

Imagine you have a text file, which contains a list of URLs (one per line), and you want to download each URL containing “kevinskii.kev” using the wget command and input it into a processing script:

for url in $(cat urls.txt | grep "kevinskii.kev"); do
    wget "$url" | python3 ./process_html.py
done

grep

Filters lines based on a search term.

# Imagine we have a CSV file called people.csv with the following contents:
# Name,Age,Occupation
# Alice,30,Engineer
# Bob,25,Data Scientist
# Charlie,35,Teacher
# Diana,28,Engineer

# Find all rows where the occupation is "Engineer"
$ grep "Engineer" ./people.csv
Alice,30,Engineer
Diana,28,Engineer

# Perform a case-insensitive search for "engineer"
$ grep -i "engineer" ./people.csv
Alice,30,Engineer
Diana,28,Engineer
Fred,56,engineer

# Find all rows where the occupation is not "Engineer"
$ grep -iv "engineer" ./people.csv
Name,Age,Occupation
Bob,25,Data Scientist
Charlie,35,Teacher

# Get the ages of the above
$ grep -iv "engineer" ./people.csv | cut -d "," -f 2
Age
25
35

gzip

Compresses a file using the gzip compression algorithm. Automatically appends a .gz extension to the filename. For a single file, gzip is generally preferred over zip, as it is a simpler and more efficient format.

# Compress a file called data.txt
gzip data.txt

gunzip

Decompresses a file compressed with gzip.

# Decompress a file called data.txt.gz
gunzip data.txt.gz

head

Prints the first few lines of a text file. By default, it prints the first 10 lines, but you can specify a different number using the -n flag.

# Display the first 10 lines of a file called data.txt
head ./data.txt

# Display the first 5 lines of a file called data.txt
head -n 5 ./data.txt

This command is handy for creating a random sample of a large text file. For example, to create a sample file containing the first 1000 lines of a large CSV file called data.csv, you would run:

head -n1 ./data.csv > ./sample_data.csv  # Copy the header row
tail -n+1 ./data.csv | sort -R | head -n1000 >> ./sample_data.csv  # Randomly shuffle and take 1000 lines

Where the tail -n+1 command outputs all lines except the first (header) line, sort -R randomly shuffles the lines, and head -n1000 takes the first 1000 lines from the shuffled output.

history

Displays your command history, which is stored in the file ~/.bash_history. Often useful in combination with the grep command to search for a specific command you have used in the past.

# Search for all commands that contain the word "dotnet"
history | grep dotnet
42  dotnet build MyProject.csproj
99  dotnet run --project MyProject.csproj

# Rerun command number 42 from the history list
!42

less

Views the contents of a file one page at a time. Useful for viewing large files without overwhelming the terminal. This is handy when browsing large text files without loading the entire file into memory, as you would have to do with a program like Notepad.exe.

When using less, the spacebar key advances to the next page, the shift+spacebar key goes back one page, and the q key quits less and returns to the terminal prompt.

less ./some_huge_text_file.txt

ls

Lists the contents of a directory.

# List all files and directories in the current directory
ls .

# List all files and directories in long format, including hidden files
ls -la .

# List all files and directories in a specific directory, with human-readable file sizes
ls -lh /mnt/c/tmp

mv

Moves or renames files and directories.

# Move file1.txt from C:\temp to home directory in WSL
mv /mnt/c/temp/file1.txt ~/

# Rename file1.txt to file2.txt in the home directory
mv ~/file1.txt ~/file2.txt

rm

Removes files or directories. Be very careful when using this command, as it permanently deletes files without moving them to a recycle bin.

# Remove a file called file1.txt
rm file1.txt

# Remove a directory and its contents recursively, without prompting for confirmation
rm -rf /path/to/folder

# Remove a directory's contents recursively, without deleting the directory itself (note the trailing slash)
rm -rf /path/to/folder/

sed

Substitutes text in a line:

# Basic substitution (first occurrence only)
echo "apple orange apple" | sed 's/apple/banana/'
# Output: banana orange apple

# Global substitution (all occurrences)
echo "apple orange apple" | sed 's/apple/banana/g'
# Output: banana orange banana

One really handy use of sed is to get column numbers from delimited text files with dozens of columns, for use with the cut command. For example, let’s say we have a CSV file named foo.txt with this content:

Name,Age,Sex,Zip
Fred,39,12434
...

If we only want the Name and Zip, we can easily use cut -d, -f1,4 ./foo.txt. But if there are many columns, this is a lot harder. We can use head to get only the header, and pipe it into sed to get the columns one per line:

# Replace every comma with a newline in the CSV header
head -n1 ./foo.txt | sed 's/,/\n/g'

This will print:

Name
Age
Sex
Zip

We can then copy/paste into a text editor to see the number of each one, i.e. “Name” is the 1st column, “Age” is the 2nd, etc.

shuf

Shuffles lines of text files randomly.

# Shuffle the lines of a file called data.txt and display the result
shuf data.txt

# Shuffle the lines of a file called data.txt and save the result to a new file called shuffled_data.txt
shuf data.txt > shuffled_data.txt

sort

Puts the lines of a text file in sorted order. Can also be used to sort numerically, reverse order, and in random order, and remove duplicate lines.

# Sort the lines of a file called data.txt in ascending order and write them to sorted_data.txt
sort ./data.txt > ./sorted_data.txt

Imagine we have a CSV file called vehicles.csv with the following contents:

Model,VIN
Ford,12345
Toyota,67890
Honda,54321
Ford,12345
...

We can use the sort command, in conjunction with the uniq command, to count the number of occurrences of each unique vehicle model and display them in descending order:

cut -d "," -f 1 ./vehicles.csv | sort | uniq -c | sort -nr

This would output something like:

  150 Ford
  120 Toyota
   90 Honda
   ...

If we want to simply count the number of unique vehicle models without displaying the counts, we can use:

cut -d "," -f 1 ./vehicles.csv | uniq | sort -u | wc -l

In this case, the -u flag tells sort to output only unique lines. To do this, it must first sort the lines, so that duplicate lines are adjacent. For huge files, this can be quite slow and memory-intensive, so we take some pressure off the system by using uniq first, which only requires a single pass through the file and removes adjacent duplicate lines. Then sort -u can operate on a much smaller set of data.

split

Splits a large text file into smaller files. This can be handy for parallelizing data processing tasks, or for breaking up large files into more manageable chunks prior to transferring them over a network. That way, if the transfer is interrupted, you only need to re-transfer a small chunk instead of the entire file.

# Split a large file called data.txt into smaller files with 1000 lines each
split -l 1000 data.txt part_

# Split a large file called data.txt into 8 equal parts
split -n 8 data.txt part_

# Launch a background process for each part, so that we are processing them in parallel
for file in part_*; do
    python3 process_data.py --input-file "$file" --output-file "${file}_processed.txt" &
done

tail

Like head, but prints the last few lines of a text file. By default, it prints the last 10 lines, but you can specify a different number using the -n flag.

# Display the last 10 lines of a file called data.txt
tail ./data.txt

The command also has the -f flag, which allows you to “follow” a file as it is being written to. This is especially useful for monitoring log files in real-time.

# Launch a process in the background that writes to a log file
my_server_process.exe > server.log &

# Follow the log file in real-time
tail -f server.log

# Press Cntrl+C to stop following the log file and do some other work...

# Come back later and follow the log file again
tail -f server.log

tar

Combines several files into a single archive file, while retaining their directory structure. For large collections of files, tar is generally preferred over zip, as it is a simpler and more efficient format. For example, we can use tar to archive and compress a directory called data_folder into a single file called data_folder.tar.gz:

tar -czvf ./data_folder.tar.gz data_folder/

To extract the contents of a tar.gz file, we can use the following command:

tar -xzvf ./data_folder.tar.gz

If you are archiving data that is already compressed, e.g. JPEG images or PNG images or zip files, you can omit the -z flag to avoid redundant compression:

tar -cvf ./data_folder.tar data_folder/

top

Displays real-time system resource usage and running processes. Useful for monitoring CPU and memory usage, as well as identifying resource-intensive processes. Also handy for verifying that a long-running process is still active.

top

uniq

Given several lines of text, removes duplicate lines (meaning lines that are identical to their immediate predecessor). Can also be used with the -c flag to count occurrences of each unique line. See the sort command for an example of how to use uniq in practice.

watch

Runs a command periodically, displaying the output in the terminal. Useful for monitoring changes in system status or file contents over time. I most often use it with the nvidia-smi command to monitor GPU usage while training machine learning models.

# Show GPU usage every 3 seconds
watch -n3 nvidia-smi

wc

Counts lines, words, and characters in text files. I never use with anything except the -l flag to count lines.

wc -l ./data.txt