Why Math Functions in C++ Are So Slow

Written by ryanburn | Published 2020/12/27
Tech Story Tags: c++ | performance | assembly | programming | benchmarks | backend | mathematics | software-development

TLDR Compilers produce code that is significantly slower than what a machine is capable of. Compilers may not be portable even between different x86-64 CPUs. Compiler will optimize the code heavily but will still produce C++ standard compliant code. Compiling the code with C++ -c -O3 -march=native, we get a benchmark benchmark. We’ll use google benchmark to measure how long it takes to compute the square roots of 1,000,000 numbers. Compile with GCC -c:native:compute_sqrt1/1000000 4984457 (4.17, 0.07, 115.05)via the TL;DR App

Performance has always been a high priority for C++, yet there are many examples both in the language and the standard library where compilers produce code that is significantly slower than what a machine is capable of. In this blog post, I’m going to explore one such example from the standard math library.
Suppose we’re tasked with computing the square roots of an array of floating point numbers. We might write a function like this to perform the operation:
// sqrt1.cpp
#include <cmath>

void compute_sqrt1(const double* x, int n, double* y) noexcept {
  for (int  i=0; i<n; ++i) {
    y[i] = std::sqrt(x[i]);
  }
}
If we’re using gcc, we can compile the code with
g++ -c -O3 -march=native sqrt1.cpp
With 
-O3
, gcc will optimize the code heavily but will still produce code that is standard compliant. The 
-march=native
 option tells gcc to produce code targeting the native architecture’s instruction set. The resulting binaries may not be portable even between different x86-64 CPUs.
Now, let’s benchmark the function. We’ll use google benchmark to measure how long it takes to compute the square roots of 1,000,000 numbers:
// benchmark.cpp
#include <random>
#include <memory>
#include <benchmark/benchmark.h>

void compute_sqrt1(const double* x, int n, double* y) noexcept;

static void generate_random_numbers(double* x, int n) {
  std::mt19937 rng{0};
  std::uniform_real_distribution<double> dist{0, 100};
  for (int i=0; i<n; ++i) {
    x[i] = dist(rng);
  }
}

static void BM_Sqrt1(benchmark::State& state) {
  const int n = state.range(0);
  std::unique_ptr<double[]> xptr{new double[n]};
  generate_random_numbers(xptr.get(), n);
  for (auto _ : state) {
    std::unique_ptr<double[]> yptr{new double[n]};
    compute_sqrt1(xptr.get(), n, yptr.get());
    benchmark::DoNotOptimize(yptr);
  }
}

BENCHMARK(BM_Sqrt1)->Arg(1000000);

BENCHMARK_MAIN();
Compiling our benchmark and running we get
g++ -O3 -march=native -o benchmark benchmark.cpp sqrt1.o
./benchmark
Running ./benchmark
Run on (6 X 2600 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x6)
  L1 Instruction 32 KiB (x6)
  L2 Unified 256 KiB (x6)
  L3 Unified 9216 KiB (x6)
Load Average: 0.17, 0.07, 0.05
-----------------------------------------------------------
Benchmark                 Time             CPU   Iterations
-----------------------------------------------------------
BM_Sqrt1/1000000    4984457 ns      4946631 ns          115

Can we do better? Let try this version:
// sqrt2.cpp
#include <cmath>

void compute_sqrt2(const double* x, int n, double* y) noexcept {
  for (int  i=0; i<n; ++i) {
    y[i] = std::sqrt(x[i]);
  }
}
and compile with
g++ -c -O3 -march=native -fno-math-errno sqrt2.cpp
The only difference between 
compute_sqrt1
 and 
compute_sqrt2
 is that we added the extra option 
-fno-math-errno
 when compiling. I’ll explain later what 
-fno-math-errno
 does; but for now, I’ll only point out that the produced code is no longer standard compliant.
Let’s benchmark 
compute_sqrt2
.
// benchmark.cpp
...
static void BM_Sqrt2(benchmark::State& state) {
  const int n = state.range(0);
  std::unique_ptr<double[]> xptr{new double[n]};
  generate_random_numbers(xptr.get(), n);
  for (auto _ : state) {
    std::unique_ptr<double[]> yptr{new double[n]};
    compute_sqrt2(xptr.get(), n, yptr.get());
    benchmark::DoNotOptimize(yptr);
  }
}

BENCHMARK(BM_Sqrt2)->Arg(1000000);
...
Running
g++ -O3 -march=native -o benchmark benchmark.cpp sqrt2.o
./benchmark
we get
Running ./benchmark
Run on (6 X 2600 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x6)
  L1 Instruction 32 KiB (x6)
  L2 Unified 256 KiB (x6)
  L3 Unified 9216 KiB (x6)
Load Average: 0.17, 0.07, 0.05
-----------------------------------------------------------
Benchmark                 Time             CPU   Iterations
-----------------------------------------------------------
BM_Sqrt2/1000000    1195070 ns      1192078 ns          553

Yikes! 
compute_sqrt2
 is more than 4 times faster than 
compute_sqrt1
.
What’s different? Let’s drill down into the assembly to find out. We can produce the assembly for the code by running
g++ -S -c -O3 -march=native sqrt1.cpp
g++ -S -c -O3 -march=native -fno-math-errno sqrt2.cpp

The result will depend on what architecture you’re using, but looking at sqrt1.s on my architecture, we see this section
.L3:
	vmovsd	(%rdi), %xmm0
	vucomisd	%xmm0, %xmm2
	vsqrtsd	%xmm0, %xmm1, %xmm1
	ja	.L12
	addq	$8, %rdi
	vmovsd	%xmm1, (%rdx)
	addq	$8, %rdx
	cmpq	%r12, %rdi
	jne	.L3

Let’s break down the first few instructions:
1:	vmovsd	(%rdi), %xmm0  
      # Load a value from memory into the register %xmm0
2:	vucomisd	%xmm0, %xmm2
      # Compare the value of %xmm0 with %xmm2 and set the register
      # EFLAGS with the result
3:	vsqrtsd	%xmm0, %xmm1, %xmm1  
      # Compute the square root of %xmm0 and store in %xmm1
4:	ja	.L12 
      # Inspects EFLAGS and jumps if %xmm2 is above %xmm0

What are instructions 3 and 4 for? Recall that for real numbers, sqrt is undefined on negative values. When std::sqrt is passed a negative number, the C++ standard requires that it return the special floating point value NaN and that it set the global variable errno to EDOM. But that error handling ends up being really expensive.
If we look at sqrt2.s, we see these instructions for the main loop:
.L6:
	addl	$1, %r8d
	vsqrtpd	(%r10,%rax), %ymm0
	vextractf128	$0x1, %ymm0, 16(%rcx,%rax)
	vmovups	%xmm0, (%rcx,%rax)
	addq	$32, %rax
	cmpl	%r8d, %r11d
	ja	.L6

Without the burden of having to do error handling, gcc can produce much faster code. vsqrtpd is what’s known as a Single Instruction Multiple Data (SIMD) instruction. It computes the the square root of four double precision floating point numbers at a time. For computationally expensive functions like sqrt, vectorization helps a lot.
It’s unfortunate that the standard requires such error handling. It’s so much slower to do the error checking that many compilers like Intel’s icc and Apple’s default clang-based compiler opt out of the error handling by default. Even if we want std::sqrt do error handling, we can’t portably rely on major compilers to do so.
The complete benchmark can be found at rnburn/cmath-bechmark.

Written by ryanburn | Mathematical Engineer | Building better models: buildingblock.ai
Published by HackerNoon on 2020/12/27