Street Fighting Data Analysis with WSL - Part 1
Overview
The Unix-based operating systems (Linux/Unix/MacOS) include a common set of powerful command line utilities that can make quick work of basic data analysis and data wrangling tasks. These utilities are especially useful when working with datasets that are several gigabytes in size: They are too large to easily be analyzed by tools like Excel, MATLAB, or Python’s pandas, yet aren’t so large that they require a full-blown distributed computing framework like Apache Spark.
This is the first in a series of planned posts which shows examples of my most common go-to commands for quickly inspecting, cleaning, and transforming datasets. In this post we’ll focus on basic file system operations such as moving, copying, searching for, unzipping, and counting lines in files.
Since most biomedical researchers seem to use Windows machines (like me), I’ll be demonstrating these commands using the Windows Subsystem for Linux (WSL). MacOS and Linux users should be able to follow along just fine, as the commands are nearly identical.
Note to Friendly IT System Administrators

Yes yes, I know that there are more efficient ways to do many of the things I’m about to describe. This post is primarily intended for biomedical researchers who aren’t concerned about best practices, but rather just want to get things done quickly and with minimal fuss.
I also acknowledge you have forgotten more about Linux than I could ever hope to know.
Quick Reference
This post will cover the following commands:
| Command | Description | Example |
|---|---|---|
| cd | Change directory | cd /path/to/directory/ |
| cp | Copy files or directories | cp /path/to/source /path/to/destination |
| find | Search for files and perform actions on them | find /path/to/directory/ -name "*.txt" -exec cat {} \; |
| ls | List directory contents | ls /path/to/directory/ |
| mv | Move (or rename) files or directories | mv /path/to/source /path/to/destination |
| wc | Word count (lines, words, characters) | wc -l /path/to/file |
Prerequisites
Installing WSL
If you’re using Windows 10 or later, you can install WSL, a fully contained Linux environment, by following the official Microsoft guide here. I recommend installing the latest stable version of Ubuntu Server as your WSL distribution, which is version 24.04 as of this writing.
Downloading the MAUDE dataset
For demonstration purposes, I’ll be using the publicly available MAUDE dataset from the U.S. Food and Drug Administration (FDA). Using a web browser, I’ve manually downloaded the text narrative files for years 2020-2024 to a temporary folder on my Windows machine:

These files contain complaint narratives for medical device defects reported to the FDA.
Basic File Operations
ls: Listing Directory Contents
I saved the downloaded zip files to D:\tmp\maude. WSL mounts Windows drives under the /mnt/ directory, so I can access my files from within WSL at /mnt/d/tmp/maude/.
To list the directory contents, we can use the ls command:
theodora@DESKTOP-G4PFJGJ:~$ ls /mnt/d/tmp/maude/
foitext2020.zip foitext2021.zip foitext2022.zip foitext2023.zip foitext2024.zip
theodora@DESKTOP-G4PFJGJ:~$
Note a couple of things:
- The prompt shows my WSL username (
theodora) and the machine name (DESKTOP-G4PFJGJ). - Upon running the
ls /mnt/d/tmp/maude/command, the foitext zip files are listed on the following line. - The tilde (
~) indicates my current working directory, which is my WSL home directory.
find: Searching for files, and doing things with them
Alternatively, we can use the find command to search for files in a directory:
theodora@DESKTOP-G4PFJGJ:~$ find /mnt/d/tmp/maude/
/mnt/d/tmp/maude/
/mnt/d/tmp/maude/foitext2020.zip
/mnt/d/tmp/maude/foitext2021.zip
/mnt/d/tmp/maude/foitext2022.zip
/mnt/d/tmp/maude/foitext2023.zip
/mnt/d/tmp/maude/foitext2024.zip
Unlike ls, the find command recursively lists all files and directories under the specified path. But that’s not all!
The find command is extremely powerful because it allows you to perform actions on the files it finds. I use it almost daily.
Exploring all of of its capabilities would warrant a blog post of its own.
Let’s use it to unzip all of the MAUDE zip files into a new directory:
theodora@DESKTOP-G4PFJGJ:~$ mkdir /mnt/d/tmp/maude/unzipped
theodora@DESKTOP-G4PFJGJ:~$ find /mnt/d/tmp/maude/ -name "*.zip" -exec unzip {} -d /mnt/d/tmp/maude/unzipped/ \;
Note a few things:
- We used the
mkdircommand to create a new subdirectory called unzipped. - The
findcommand searches for all files with names matching the pattern*.zipunder the specified path. - The
-execflag allows us to execute a command on each file found. Here, we use theunzipcommand to extract each zip file ({}is a placeholder for the current file found) into the unzipped directory.
Now if we look inside the unzipped directory, we can see all of the extracted text files:
theodora@DESKTOP-G4PFJGJ:~$ ls /mnt/d/tmp/maude/unzipped/
foitext2020.txt foitext2021.txt foitext2022.txt foitext2023.txt foitext2024.txt

wc: Word Count (and line count)
The wc (word count) command is a simple yet useful utility for counting lines, words, and characters in text files. I always use it to count lines, via the “-l” flag.
theodora@DESKTOP-G4PFJGJ:~$ wc -l /mnt/d/tmp/maude/unzipped/*.txt
4020739 /mnt/d/tmp/maude/unzipped/foitext2020.txt
4872212 /mnt/d/tmp/maude/unzipped/foitext2021.txt
6640118 /mnt/d/tmp/maude/unzipped/foitext2022.txt
5439271 /mnt/d/tmp/maude/unzipped/foitext2023.txt
6083071 /mnt/d/tmp/maude/unzipped/foitext2024.txt
27055411 total
There are over 27 million lines across all five text files, and they total over 10 gigabytes in size! Loading these files into memory using a typical data analysis tool (e.g., pandas in Python, or data.frame in R) would be impractical on most personal computers.
If you are actually following along and trying these commands yourself (of course you’re not; nor would I), you might have noticed that the above command took several minutes to run. This is because the files are located on the Windows filesystem, not inside the WSL filesystem. This slows access down by more than an order of magnitude. Think of WSL as being a separate PC, and it has to communicate with your Windows machine over a slow network connection.
mv and cp: Moving and Copying Files
To speed things up, let’s move the zip files into the WSL filesystem, into a maude directory in my home folder, then we’ll unzip them there and run wc again:
theodora@DESKTOP-G4PFJGJ:~$ mkdir -p ~/maude/unzipped
theodora@DESKTOP-G4PFJGJ:~$ mv /mnt/d/tmp/maude/*.zip ~/maude/
theodora@DESKTOP-G4PFJGJ:~$ find ~/maude/ -name "*.zip" -exec unzip {} -d ~/maude/unzipped/ \;
theodora@DESKTOP-G4PFJGJ:~$ cd ~/maude/unzipped/
theodora@DESKTOP-G4PFJGJ:~/maude/unzipped$ ls .
foitext2020.txt foitext2021.txt foitext2022.txt foitext2023.txt foitext2024.txt
theodora@DESKTOP-G4PFJGJ:~/maude/unzipped$ wc -l ./*.txt
4020739 ./foitext2020.txt
4872212 ./foitext2021.txt
6640118 ./foitext2022.txt
5439271 ./foitext2023.txt
6083071 ./foitext2024.txt
27055411 total
Some more notes:
- The
-pflag in themkdircommand creates parent directories as needed. For example, since the ”~/maude/” directory does not exist yet, it will be created along with the “unzipped” subdirectory. - The
mvcommand moves (or renames) files or directories. - The period ”.” means “current directory”. So
./*.txtmeans “all text files in the current directory”. You’ll often see the ”./” prefix used before filenames, as it really helps with tabbed autocompletion in the terminal.
This time, when I ran the wc command, it completed in just a few seconds (nearly a 100X speedup), since the files are now located within the WSL filesystem.
Another really handy way to move files to/from WSL is to run the explorer.exe command from within WSL. This will open a Windows File Explorer window at the specified path:
theodora@DESKTOP-G4PFJGJ:~/maude/unzipped$ explorer.exe .
Now you can easily drag-and-drop files between Windows and WSL!

Important Note About WSL Disk Usage
WSL stores all of its data in a file with a “virtual hard disk” (.vhdx) extension. The location is distribution-dependent. While WSL will automatically grow this file as needed, it will not automatically shrink it when files are deleted within WSL. As I do the majority of my work within WSL, I occasionally find that my WSL virtual hard disk has grown to tens of gigabytes in size, even though I’ve deleted large files within WSL. For this reason it’s generally best not to store large files in WSL long term, though I sometimes forget to clean up.
There are several ways to reclaim disk space if necessary. One common way is by using the diskpart command. Another alternative is to simply export my WSL distribution to a tarball (assuming you have enough free disk space to store it), unregister the distribution (which deletes the .vhdx file), then re-import it from the tarball. (This also is a handy way to migrate a WSL distribution to a new PC.) Once re-imported, the tarball can be deleted.
For example, from a Windows PowerShell prompt, I would run:
wsl —shutdown
wsl —export <DistroName> <TempFileName>.tar
wsl —import <DistroName> <InstallLocation> <TempFileName>.tar —version 2
del <TempFileName>.tar