Street Fighting Data Analysis with WSL - Part 1

November 27, 2025 9 min read Kevin Davis
#data science #wsl #Linux

Overview

The Unix-based operating systems (Linux/Unix/MacOS) include a common set of powerful command line utilities that can make quick work of basic data analysis and data wrangling tasks. These utilities are especially useful when working with datasets that are several gigabytes in size: They are too large to easily be analyzed by tools like Excel, MATLAB, or Python’s pandas, yet aren’t so large that they require a full-blown distributed computing framework like Apache Spark.

This is the first in a series of planned posts which shows examples of my most common go-to commands for quickly inspecting, cleaning, and transforming datasets. In this post we’ll focus on basic file system operations such as moving, copying, searching for, unzipping, and counting lines in files.

Since most biomedical researchers seem to use Windows machines (like me), I’ll be demonstrating these commands using the Windows Subsystem for Linux (WSL). MacOS and Linux users should be able to follow along just fine, as the commands are nearly identical.

⚠️

Note to Friendly IT System Administrators

Your friendly IT system administrator..

Yes yes, I know that there are more efficient ways to do many of the things I’m about to describe. This post is primarily intended for biomedical researchers who aren’t concerned about best practices, but rather just want to get things done quickly and with minimal fuss.

I also acknowledge you have forgotten more about Linux than I could ever hope to know.

Quick Reference

This post will cover the following commands:

CommandDescriptionExample
cdChange directorycd /path/to/directory/
cpCopy files or directoriescp /path/to/source /path/to/destination
findSearch for files and perform actions on themfind /path/to/directory/ -name "*.txt" -exec cat {} \;
lsList directory contentsls /path/to/directory/
mvMove (or rename) files or directoriesmv /path/to/source /path/to/destination
wcWord count (lines, words, characters)wc -l /path/to/file

Prerequisites

Installing WSL

If you’re using Windows 10 or later, you can install WSL, a fully contained Linux environment, by following the official Microsoft guide here. I recommend installing the latest stable version of Ubuntu Server as your WSL distribution, which is version 24.04 as of this writing.

Downloading the MAUDE dataset

For demonstration purposes, I’ll be using the publicly available MAUDE dataset from the U.S. Food and Drug Administration (FDA). Using a web browser, I’ve manually downloaded the text narrative files for years 2020-2024 to a temporary folder on my Windows machine:

Windows folder containing MAUDE zip files

These files contain complaint narratives for medical device defects reported to the FDA.

Basic File Operations

ls: Listing Directory Contents

I saved the downloaded zip files to D:\tmp\maude. WSL mounts Windows drives under the /mnt/ directory, so I can access my files from within WSL at /mnt/d/tmp/maude/. To list the directory contents, we can use the ls command:

theodora@DESKTOP-G4PFJGJ:~$ ls /mnt/d/tmp/maude/
foitext2020.zip  foitext2021.zip  foitext2022.zip  foitext2023.zip  foitext2024.zip
theodora@DESKTOP-G4PFJGJ:~$

Note a couple of things:

  • The prompt shows my WSL username (theodora) and the machine name (DESKTOP-G4PFJGJ).
  • Upon running the ls /mnt/d/tmp/maude/ command, the foitext zip files are listed on the following line.
  • The tilde (~) indicates my current working directory, which is my WSL home directory.

find: Searching for files, and doing things with them

Alternatively, we can use the find command to search for files in a directory:

theodora@DESKTOP-G4PFJGJ:~$ find /mnt/d/tmp/maude/
/mnt/d/tmp/maude/
/mnt/d/tmp/maude/foitext2020.zip
/mnt/d/tmp/maude/foitext2021.zip
/mnt/d/tmp/maude/foitext2022.zip
/mnt/d/tmp/maude/foitext2023.zip
/mnt/d/tmp/maude/foitext2024.zip

Unlike ls, the find command recursively lists all files and directories under the specified path. But that’s not all! The find command is extremely powerful because it allows you to perform actions on the files it finds. I use it almost daily. Exploring all of of its capabilities would warrant a blog post of its own.

Let’s use it to unzip all of the MAUDE zip files into a new directory:

theodora@DESKTOP-G4PFJGJ:~$ mkdir /mnt/d/tmp/maude/unzipped
theodora@DESKTOP-G4PFJGJ:~$ find /mnt/d/tmp/maude/ -name "*.zip" -exec unzip {} -d /mnt/d/tmp/maude/unzipped/ \;

Note a few things:

  • We used the mkdir command to create a new subdirectory called unzipped.
  • The find command searches for all files with names matching the pattern *.zip under the specified path.
  • The -exec flag allows us to execute a command on each file found. Here, we use the unzip command to extract each zip file ({} is a placeholder for the current file found) into the unzipped directory.

Now if we look inside the unzipped directory, we can see all of the extracted text files:

theodora@DESKTOP-G4PFJGJ:~$ ls /mnt/d/tmp/maude/unzipped/
foitext2020.txt  foitext2021.txt  foitext2022.txt  foitext2023.txt  foitext2024.txt

Windows folder containing unzipped MAUDE files

wc: Word Count (and line count)

The wc (word count) command is a simple yet useful utility for counting lines, words, and characters in text files. I always use it to count lines, via the “-l” flag.

theodora@DESKTOP-G4PFJGJ:~$ wc -l /mnt/d/tmp/maude/unzipped/*.txt
    4020739 /mnt/d/tmp/maude/unzipped/foitext2020.txt
    4872212 /mnt/d/tmp/maude/unzipped/foitext2021.txt
    6640118 /mnt/d/tmp/maude/unzipped/foitext2022.txt
    5439271 /mnt/d/tmp/maude/unzipped/foitext2023.txt
    6083071 /mnt/d/tmp/maude/unzipped/foitext2024.txt
   27055411 total

There are over 27 million lines across all five text files, and they total over 10 gigabytes in size! Loading these files into memory using a typical data analysis tool (e.g., pandas in Python, or data.frame in R) would be impractical on most personal computers.

If you are actually following along and trying these commands yourself (of course you’re not; nor would I), you might have noticed that the above command took several minutes to run. This is because the files are located on the Windows filesystem, not inside the WSL filesystem. This slows access down by more than an order of magnitude. Think of WSL as being a separate PC, and it has to communicate with your Windows machine over a slow network connection.

mv and cp: Moving and Copying Files

To speed things up, let’s move the zip files into the WSL filesystem, into a maude directory in my home folder, then we’ll unzip them there and run wc again:

theodora@DESKTOP-G4PFJGJ:~$ mkdir -p ~/maude/unzipped
theodora@DESKTOP-G4PFJGJ:~$ mv /mnt/d/tmp/maude/*.zip ~/maude/
theodora@DESKTOP-G4PFJGJ:~$ find ~/maude/ -name "*.zip" -exec unzip {} -d ~/maude/unzipped/ \;
theodora@DESKTOP-G4PFJGJ:~$ cd ~/maude/unzipped/
theodora@DESKTOP-G4PFJGJ:~/maude/unzipped$ ls .
foitext2020.txt  foitext2021.txt  foitext2022.txt  foitext2023.txt  foitext2024.txt
theodora@DESKTOP-G4PFJGJ:~/maude/unzipped$ wc -l ./*.txt
    4020739 ./foitext2020.txt
    4872212 ./foitext2021.txt
    6640118 ./foitext2022.txt
    5439271 ./foitext2023.txt
    6083071 ./foitext2024.txt
   27055411 total

Some more notes:

  • The -p flag in the mkdir command creates parent directories as needed. For example, since the ”~/maude/” directory does not exist yet, it will be created along with the “unzipped” subdirectory.
  • The mv command moves (or renames) files or directories.
  • The period ”.” means “current directory”. So ./*.txt means “all text files in the current directory”. You’ll often see the ”./” prefix used before filenames, as it really helps with tabbed autocompletion in the terminal.

This time, when I ran the wc command, it completed in just a few seconds (nearly a 100X speedup), since the files are now located within the WSL filesystem.

Another really handy way to move files to/from WSL is to run the explorer.exe command from within WSL. This will open a Windows File Explorer window at the specified path:

theodora@DESKTOP-G4PFJGJ:~/maude/unzipped$ explorer.exe .

Now you can easily drag-and-drop files between Windows and WSL! WSL folder in Windows Explorer

⚠️

Important Note About WSL Disk Usage

WSL stores all of its data in a file with a “virtual hard disk” (.vhdx) extension. The location is distribution-dependent. While WSL will automatically grow this file as needed, it will not automatically shrink it when files are deleted within WSL. As I do the majority of my work within WSL, I occasionally find that my WSL virtual hard disk has grown to tens of gigabytes in size, even though I’ve deleted large files within WSL. For this reason it’s generally best not to store large files in WSL long term, though I sometimes forget to clean up.

There are several ways to reclaim disk space if necessary. One common way is by using the diskpart command. Another alternative is to simply export my WSL distribution to a tarball (assuming you have enough free disk space to store it), unregister the distribution (which deletes the .vhdx file), then re-import it from the tarball. (This also is a handy way to migrate a WSL distribution to a new PC.) Once re-imported, the tarball can be deleted.

For example, from a Windows PowerShell prompt, I would run:

wsl —shutdown
wsl —export <DistroName> <TempFileName>.tar
wsl —import <DistroName> <InstallLocation> <TempFileName>.tar —version 2
del <TempFileName>.tar