Signal Standardization using SIMD in .NET

Motivation

We recently implemented a web API that classifies electrocardiogram (ECG) signals using a convolutional neural network (CNN). Before feeding the ECG signals into the CNN, one of the preprocessing steps is to standardize them to have zero mean and unit variance: $\hat{\mathbf{x}} = \frac{\mathbf{x} - \mu}{\sigma}$ . This is a common machine learning technique to improve gradient stability and accelerate convergence during training. It can be easily and efficiently done using Python’s numpy library, i.e.

x = x - x.mean() / x.std()

Where x is a vector of ECG samples. Numpy takes advantage of highly optimized C and Fortran libraries to efficiently vectorize each part of the calculation. MATLAB works similarly.

Had we implemented the web API in Python, this preprocessing step would have been trivial. Instead, we chose to implement it in C# using the ASP.NET Core framework. This was partly for performance reasons, but mostly for stability: I am generally hesitant to use Python code in production due to its dynamically typed and interpreted nature. Most of my past Python runtime bugs would have been caught during compile time in a language like C#, and fixing bugs in a regulated med device environment is a lot more expensive than it might be in other industries where developers can quickly push releases into production.

However, C# and similar languages are not ideal for scientific computing, and implementing an optimized standardization step is quite a bit more difficult than the one-line Python snippet shown above. It took me at least a couple of hours, with the help of GitHub Copilot¹, to code and unit test a comparable function that uses SIMD in a manner similar to what MATLAB and numpy do.

Since then, I’ve wondered whether this was a premature optimization. The web API needs to run model inference on a CPU, which would likely dwarf the time taken on the preprocessing steps. Should I have just implemented this the naive way and moved on to other things? That’s what this blog post attempts to answer. First I’ll compare a naive implementation of standardization to one that uses SIMD, and then I’ll show some benchmark code and its results before offering some concluding thoughts.

Very Brief Intro to SIMD

“Single instruction, multiple data” (SIMD) is a feature of modern CPUs which allows us to combine several operations in a single instruction. For instance, consider this simple function which sums all the values in an array. We won’t worry about things like arithmetic overflow; this is for illustration purposes only:

float sum(float[] arr) {
    float result = 0.0;
    for(int i = 0; i < arr.Length; ++i) {
        result += arr[i];
    }
    return result;
}

Given an array of values like [1.0, 2,0, 3,0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0], the value of result will equal 1.0 after the first iteration, 3.0 after the 2nd, 6.0 after the 3rd, and so on.

Because most modern CPUs are 64-bit processors, and floats are typically 32 bits each, we can use SIMD to process two array elements with each loop iteration. To accomplish this, we’ll change result from a single float value to a vector that is initialized with the values [0.0, 0.0].

After the first loop iteration it will have the values [1.0, 2.0] added to it, so it will contain [1.0, 2.0]. Then we advance two elements and add the values [3.0, 4.0] to it, and it will contain [4.0, 6.0]. On the third iteration it will have [5.0, 6.0] added to it, and so on.

When we’re done looping through the array, we can then simply sum the elements in result. We have therefore effectively cut our number of loop iterations in half. If the array size is not a multiple of the result vector size (as is the case here), then we just need to loop through the leftover elements (in this case 9.0) and add them to the final result.

Implementations of Signal Standardization

For comparison, I’ve implemented signal standardization three different ways: Naive, SIMD, and using the popular MathNet.Numerics.Statistics package.

Mostly to practice my LaTeX formulas, recall that the mean $\mu$ is simply the sum of elements divided by the number of elements, and standard deviation $\sigma$ is the square root of the variance.

\quad \mu = \frac{1}{n}\sum_{j=1}^n x_j,\; \sigma = \sqrt{\frac{1}{n}\sum_{j=1}^n (x_j - \mu)^2}

Naive Implementation

Here is what a naive (non-optimal) implementation of signal standardization might look like in C#.

public static void Standardize_Naive(Span<double> signal)
{
    // Calculate mean
    double sum = 0;
    for (int i = 0; i < signal.Length; i++)
    {
        sum += signal[i];
    }
    double mean = sum / signal.Length;

    // Calculate standard deviation
    double variance = 0;
    for (int i = 0; i < signal.Length; i++)
    {
        variance += (signal[i] - mean) * (signal[i] - mean);
    }
    variance /= signal.Length;
    double std = Math.Sqrt(variance);

    // Standardize each sample
    for (int i = 0; i < signal.Length; i++)
    {
        signal[i] = (signal[i] - mean) / std;
    }
}

To reduce unnecessary memory copies, this function uses a Span<T> and standardizes the signal in place.

SIMD Implementation

The SIMD version has quite a bit more code. I’ve extensively commented it to explain each step. Feel free to scroll past it if you just want to see the benchmark results. It has a few new calls which I would have preferred to avoid, but I couldn’t figure out a way around them. As we’ll see, though, they have minimal impact on performance. Most likely they aren’t actually allocating new heap memory, just creating smart pointers to existing memory.

public static void Standardize_SIMD(Span<double> signal)
{
    // --------------------------
    // Calculate mean using SIMD.
    // --------------------------
    // First, slide an n-element vector across the span, adding elements from the span
    // to the vector as we go. For example, given a span of 7 elements and a 2-element vector,
    // the summation process will look like this:
    //
    // Start:
    // Span = [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0]
    // Summing vector = [0.0, 0.0]
    //
    // After step 0: Summing vector = [1.0, 2.0]
    // After step 1: Summing vector = [4.0, 6.0]
    // After step 2: Summing vector = [9.0, 12.0]
    //
    // Notice the last element in the span, 7.0, didn't get added yet. We'll handle this later.
    Vector<double> sumVector = Vector<double>.Zero;
    int simdLength = Vector<double>.Count;
    int i = 0;
    for (; i < signal.Length - simdLength; i += simdLength)
    {
        var values = new Vector<double>(signal.Slice(i, simdLength));
        sumVector += values;
    }

    // Now compute the sum of all of the elements in the vector. In the above
    // example, this would give sum = 9.0 + 12.0 = 21.0.
    double sum = Vector.Dot(sumVector, Vector<double>.One);

    // Finally, we have to add the remaining elements to the sum, i.e. the ones
    // which were left over at the end of the span after sliding the vector across
    // it.
    for (; i < signal.Length; i++)
    {
        sum += signal[i];
    }

    // Now that we have summed all of the elements in the span, we can use it to compute the mean.
    double mean = sum / signal.Length;


    // ---------------------------------------
    // Calculate Standard Deviation using SIMD
    // ---------------------------------------
    // This process is very similar to the "mean" calculation process explained above.
    Vector<double> varianceSumVector = Vector<double>.Zero;
    var meanVector = new Vector<double>(mean);
    i = 0;
    for (; i < signal.Length - simdLength; i += simdLength)
    {
        var values = new Vector<double>(signal.Slice(i, simdLength));
        var diff = values - meanVector;
        varianceSumVector += diff * diff;
    }
    double varianceSum = Vector.Dot(varianceSumVector, Vector<double>.One);
    // Handle remaining elements
    for (; i < signal.Length; i++)
    {
        double diff = signal[i] - mean;
        varianceSum += diff * diff;
    }
    // Compute the standard deviation from the variance
    double variance = varianceSum / signal.Length;
    double stdDev = Math.Sqrt(variance);


    // ----------------------
    // Standardize using SIMD
    // ----------------------
    // Again, similar to above.
    var stdDevVector = new Vector<double>(stdDev);
    i = 0;
    for (; i <= signal.Length - simdLength; i += simdLength)
    {
        var values = new Vector<double>(signal.Slice(i, simdLength));
        var standardized = (values - meanVector) / stdDevVector;
        standardized.CopyTo(signal.Slice(i, simdLength));
    }
    // Handle remaining elements
    for (; i < signal.Length; i++)
    {
        signal[i] = (signal[i] - mean) / stdDev;
    }
}

MathNet.Numerics Implementation

Using the Math.Net.Numerics library takes considerably less code, though at a glance I wouldn’t expect it to perform as well as the SIMD version because of the expensive ToArray() call which likely results in a new memory allocation. I’m not aware of any better way to make the library work with memory-efficient structures like Span<T>.

public static void Standardize_MathNet(Span<double> signal)
{
    // MathNet works with IEnumerable<double>, so use ToArray for Span
    var arr = signal.ToArray();
    double mean = arr.Mean();
    double std = arr.StandardDeviation();

    for (int i = 0; i < signal.Length; i++)
    {
        signal[i] = (signal[i] - mean) / std;
    }
}

Benchmarking the Performance

With the kind assistance of the Copilot LLM, I created a small scratch program which uses the popular BenchmarkDotNet Nuget package to run several experiments for each implementation and compare the results. In each experiment, I standardize a set of 10,000 signals, each with 20,000 elements, which closely matches the expected size of a real ECG.

The benchmark harness class is shown below. I haven’t extensively commented it, but it should be pretty self-explanatory to most C# programmers. The only things worth noting are that I initialize the test data once, and then make a copy of its values at the beginning of each experiment because the functions modify the input in place. Although this shouldn’t have any bearing whatsoever on performance, it’s nice to start from a consistent baseline each time in case I need to dig in further.

using BenchmarkDotNet.Attributes;

public class StandardizationBenchmarks
{
    private double[][] _signals;
    private double[][] _signalsCopy;
    private const int _NUM_ROWS = 10000;
    private const int _SIGNAL_LEN = 20000;


    [GlobalSetup]
    public void Setup()
    {
        var rng = new Random(42);
        _signals = new double[_NUM_ROWS][];
        for (int i = 0; i < _NUM_ROWS; i++)
        {
            _signals[i] = new double[_SIGNAL_LEN];
            for (int j = 0; j < _SIGNAL_LEN; j++)
                _signals[i][j] = rng.NextDouble();
        }
    }


    [IterationSetup]
    public void IterationSetup()
    {
        // Deep copy for each iteration
        _signalsCopy = new double[_NUM_ROWS][];
        for (int i = 0; i < _NUM_ROWS; i++)
        {
            _signalsCopy[i] = new double[_SIGNAL_LEN];
            Array.Copy(_signals[i], _signalsCopy[i], _SIGNAL_LEN);
        }
    }


    [Benchmark]
    public void Naive()
    {
        for (int i = 0; i < _NUM_ROWS; i++)
        {
            SignalStandardizer.Standardize_Naive(_signalsCopy[i]);
        }
    }


    [Benchmark]
    public void SIMD()
    {
        for (int i = 0; i < _NUM_ROWS; i++)
        {
            SignalStandardizer.Standardize_SIMD(_signalsCopy[i]);
        }
    }


    [Benchmark]
    public void MathNet()
    {
        for (int i = 0; i < _NUM_ROWS; i++)
        {
            SignalStandardizer.Standardize_MathNet(_signalsCopy[i]);
        }
    }
}

Results

As expected (and hoped), the SIMD implementation was the fastest, coming in at roughly 2X faster than the naive implementation. This makes sense because it’s effectively halving the number of loop iterations. The MathNet implementation was the slowest. Although it might be pretty efficient at calculating Mean() and StandardDeviation(), it still has to loop through all the array elements to adjust each one, and it does a new heap allocation with each call. Here are the experiment times:

| Method  | Mean       | Error    | StdDev   |
|-------- |-----------:|---------:|---------:|
| Naive   |   393.7 ms |  3.09 ms |  2.89 ms |
| SIMD    |   210.4 ms |  3.77 ms |  3.34 ms |
| MathNet | 1,476.1 ms | 20.00 ms | 17.73 ms |

Concluding Thoughts

When doing model inference on a CPU in the production environment, the web API is able to classify at a rate of roughly 100ms per ECG. Although the SIMD implementation is twice as fast as the naive one, it only saves about 0.02ms per ECG.

In other words, the performance improvement is a tiny drop in the bucket. Since the web API is already one of the fastest components in the processing chain by at least an order of magnitude, it would be fair to say that this whole optimization was premature and unnecessary. I should have knocked out the naive implementation and moved on to more important things. Had I bothered to benchmark the naive version first thing, I might have done so.

However, there are a few other factors worth considering. The first is that I finally got hands-on experience with a non-trivial SIMD implementation. Although it took me a couple of extra hours of work to implement and test this optimized version, the next time it won’t take nearly as long, and I’ll be able to quickly teach it to other people. Compounding knowledge has proven to be very useful over the long run.

Second, software has a tendency of being adapted for purposes other than its original intended use. It’s not uncommon to see someone write a program which is designed to process a few hundred data elements, and therefore give little thought to performance—only to later discover that another customer wants to use the same program to process a few million. When it runs using a GPU (as it almost certainly will someday), the web API is around 100X faster. The standardization overhead will go from around 0.2% to around 20%.

Finally, I don’t have the bandwidth (or interest) to regularly maintain every piece of production software I’ve written. By paying attention to some of the low-hanging fruit optimizations up front, it will often last many years before having to be revisited.

Footnotes

While Copilot was quite helpful in coming up with working examples of SIMD vectorization in .NET, its proposed standardization code included several unnecessary and expensive memory copies, which it was not able to fix when prompted. This was a few months ago. The result today might be rather different. ↩