Issue

I'm an R developer that uses C for algorithmic purposes and have a question about why a C loop that seems like it would be slow is actually faster than the alternative approach.

In R, our boolean type can actually have 3 values, true, false, and na, and we represent this using an int at the C level.

I'm looking into a vectorized && operation (yes we have this in R already, but bear with me) that also handles the na case. The scalar results would look like this:

 F && F == F
 F && T == F
 F && N == F

 T && F == F
 T && T == T
 T && N == N

 N && F == F
 N && T == N
 N && N == N

Note that it works like && in C except that na values propagate when combined with anything except false, in which case we "know" that && can never be true, so we return false.

Now to the implementation, assume we have two vectors v_out and v_x, and we'd like to perform the vectorized && on them. We are allowed to overwrite v_out with the result. One option is:

// Option 1
for (int i = 0; i < size; ++i) {
  int elt_out = v_out[i];
  int elt_x = v_x[i];

  if (elt_out == 0) {
    // Done
  } else if (elt_x == 0) {
    v_out[i] = 0;
  } else if (elt_out == na) {
    // Done
  } else if (elt_x == na) {
    v_out[i] = na;
  }
}

and another option is:

// Option 2
for (int i = 0; i < size; ++i) {
  int elt_out = v_out[i];

  if (elt_out == 0) {
    continue;
  }

  int elt_x = v_x[i];

  if (elt_x == 0) {
    v_out[i] = 0;
  } else if (elt_out == na) {
    // Done
  } else if (elt_x == na) {
    v_out[i] = na;
  }
}

I sort of expected the second option to be faster because it avoids accessing v_x[i] when it isn't required. But in fact it was twice as slow when compiled with -O2!

In the following script, I get the following timing results. Note that I am on a Mac and compile with clang.

Seems reasonable with O0, they are about the same.
2x faster with O2 with Option 1!

Option 1, `clang -O0`
0.110560

Option 2, `clang -O0`
0.107710

Option 1, `clang -O2`
0.032223

Option 2, `clang -O2`
0.070557

Can anyone explain what is going on here? My best guess is that it has something to do with the fact that in Option 1 v_x[i] is always being accessed linearly, which is extremely fast. But in Option 2, v_x[i] is essentially being accessed randomly (sort of) because it might access v_x[10] but then not need another element from v_x until v_x[120], and because that access isn't linear, it is probably much slower.

Reproducible script:

#include <stdlib.h>
#include <stdio.h>
#include <limits.h>
#include <time.h>

int main() {
  srand(123);

  int size = 1e7;
  int na = INT_MIN;

  int* v_out = (int*) malloc(size * sizeof(int));
  int* v_x = (int*) malloc(size * sizeof(int));

  // Generate random numbers between 1-3
  // 1 -> false
  // 2 -> true
  // 3 -> na
  for (int i = 0; i < size; ++i) {
    int elt_out = rand() % 3 + 1;

    if (elt_out == 1) {
      v_out[i] = 0;
    } else if (elt_out == 2) {
      v_out[i] = 1;
    } else {
      v_out[i] = na;
    }

    int elt_x = rand() % 3 + 1;

    if (elt_x == 1) {
      v_x[i] = 0;
    } else if (elt_x == 2) {
      v_x[i] = 1;
    } else {
      v_x[i] = na;
    }
  }

  clock_t start = clock();

  // Option 1
  for (int i = 0; i < size; ++i) {
    int elt_out = v_out[i];
    int elt_x = v_x[i];

    if (elt_out == 0) {
      // Done
    } else if (elt_x == 0) {
      v_out[i] = 0;
    } else if (elt_out == na) {
      // Done
    } else if (elt_x == na) {
      v_out[i] = na;
    }
  }

  // // Option 2
  // for (int i = 0; i < size; ++i) {
  //   int elt_out = v_out[i];
  //
  //   if (elt_out == 0) {
  //     continue;
  //   }
  //
  //   int elt_x = v_x[i];
  //
  //   if (elt_x == 0) {
  //     v_out[i] = 0;
  //   } else if (elt_out == na) {
  //     // Done
  //   } else if (elt_x == na) {
  //     v_out[i] = na;
  //   }
  // }

  clock_t end = clock();
  double time = (double) (end - start) / CLOCKS_PER_SEC;

  free(v_out);
  free(v_x);

  printf("%f\n", time);
  return 0;
}

Updates: Based on a few questions in the comments, here are a few points of clarifications for future readers:

I am on a 2018 15-inch Macbook Pro with a 2.9 GHz 6-Core Intel i9-8950HK (6 core Coffee Lake.)
My particular clang version that I tested with is Apple clang version 13.1.6 (clang-1316.0.21.2.5) with Target: x86_64-apple-darwin21.6.0
I am restricted by R to use int as the data type (even though there are more efficient options) and the following coding: false = 0, true = 1, na = INT_MIN. The reproducible example that I provided respects this.
The original question was not actually a request to make the code run faster, I just wanted to get an idea of what the difference was between my two if/else approaches. That said, some answers have shown that branchless approaches can be much faster, and I really appreciate the explanations that those users have provided! That has greatly influenced the final version of the implementation I am working on.

Solution

Why is this seemingly slower C loop actually twice as fast as the other way?

At a high level, it is a quirk of the compiler and execution environment you are using. Unless array v_x is declared volatile, the compiler is free to interpret the two variations on your code exactly the same way.

I sort of expected the second option to be faster because it avoids accessing v_x[i] when it isn't required.

And if the compiler's optimizer judged that to be true, then it could make use of that judgement to conditionally avoid reading v_x[i] in conjunction with the first code.

But at a lower level, if the compiler generates code that indeed does conditionally avoid reading v_x[i] in option 2 but not in option 1, then you are probably observing the effects of branch misprediction in the option 2 case. It is entirely plausible that it is cheaper on average to read v_x[i] unconditionally than to suffer large numbers of branch misprediction penalties involving whether it should be read.

One of the takeaways is that on modern hardware, branches can be a lot more expensive than one might expect, especially when the branch is difficult for the CPU to predict. Where the same computation can be performed via a branchless approach, that may yield a performance win in practice, at, usually, a cost in source code clarity. @KarlKnechtel's answer presents a possible branchless (but for testing the for loop condition, which is pretty predictable) variation on the computation you are trying to perform.

Answered By - John Bollinger

Answer Checked By - Pedro (PHPFixing Volunteer)

Monday, October 31, 2022

[FIXED] Why is this seemingly slower C loop actually twice as fast as the other way?

Issue

Solution

0 Comments:

Post a Comment

Total Pageviews

Featured Post

Why Learn PHP Programming

Monday, October 31, 2022

Issue

Solution

0 Comments:

Post a Comment

Total Pageviews

Featured Post

Why Learn PHP Programming

Subscribe To