Playing around with Vectorization

While I'm on paternity leave, I'm enjoying a bit of time off to write a blog article. Here we go!

Vectorization of code is one of the key tenets of performance computing. Unfortunately, these aspects are quite hidden in high-level programming language (Python/R/SQL) and Data Scientists are unaware of the inner workings. As a matter of fact, even Software Engineer at Tech companies rarely go that low.  

This post aims at rediscovering them. Let's dive in!


Problem Statement:

Let's take a simple problem: A social media company wants to send invites to folks that have a 2nd degree connection (i.e. at least 1 friend in common) but are not friends yet. The problem gives you a N x N matrix containing bits (1=friend). How many emails would you send?

Context: There are better ways to solve this problem (for instance, by storing the index of the friends as opposed to all bits). Here, we are taking the data format as a constraint.

Implementations:


0- Baseline approach

We start with the most intuitive approach: store the matrix into a vector<vector<bool>>. The algorithm goes this way:
  • For each user i 
    • For each user j
      • if i and j are not already friends
        • Check if there is a user c such as c is both friends with i and j
This approach has O(n^3) time complexity and runs in 39ms on my computer (Macbook Pro Intel i-7).  

The code looks like this:

    // short-hand

    #define rep(i, a, b) for (int i = (a); i < (int) (b); ++i) 

    auto input = load_as_vector(file_name);
    int q = input.first; // number of users
    vector<vector<bool>> &input_data = input.second;

    bool not_friend;
    ull cnt = 0; // email to send
    rep(i, 0, q) { // from these users
        rep(j, i + 1, q) { // to these users
            not_friend = !input_data[i][j];
            if (not_friend) {
                bool flag = false;
                rep(c, 0, q) {
                    flag = input_data[i][c] & input_data[j][c];
                    if (flag) {
                        cnt++;
                        break;
                    }
                }
            }
        }

    return cnt << 1;


1- Bitset approach

A smarter approach would be to use the specialized data structure (bitset) that can do the heavy lifting for us. Instead of having a third loop, one can just do a bitwise-AND between the two relationship bitsets.

This approach has the same complexity but is much faster (>2X), clocking at 16ms. The code now looks like that (note that the 3rd loop has been replaced but the bitwise-AND):

auto input = load_as_bitset(file_name);
int q = input.first;
vector<bitset<2000>> &input_data = input.second;

bool not_friend = false;
ull cnt = 0;

rep(i, 0, q) { // from these users
    rep(j, i + 1, q) { // to these users
        not_friend = !input_data[i][j];
        if (not_friend) {
            bitset<2000> common_friends = input_data[i] & input_data[j]; 
if (common_friends.any()) cnt += 1; } }
return cnt << 1;



2- Bitpacking + SIMD approach

The idea behind this approach is to pack 64 relationships into a single integer (an Unsigned Long Long or ULL). Using Unsigned makes sense as it uses 100% of the capacity (whereas a Signed int would be at 63/64). To get a specific relationship (say with user n), we just have to do a bitshift:  (i >> n) & 1.

Our solution now clocks at 11ms so over 3X faster than the baseline approach (despite doing more computation) and 30% faster than the Bitset approach (which is a bit surprising to me).

Why is it so fast? Well, it's structured in a way to allow the compiler a lot of computation in parallel:
  • One can check 64 relationships at the time doing a bitwise AND between two ULL
  • Furthermore, the 3rd loop is nicely set up as a SIMD reduction (running vectorized AND and reducing with vectorized OR). 
Digging deeper into #2, let's look at the assembly code of the 3rd loop in Godbolt. You can see the 2 vectorized instructions show up (VPANDQ for vectorized AND and VPORQ for vectorized OR). Nice! 



Now, depending on your compiler, this could compare up to 512 relationships (that is, 64 * 8ULL) at a time! What a speed-up! 

The code looks like this (with some help from SO):

auto input = load_as_packed_ull(file_name);
int q = input.first;
auto &input_data = input.second;

bool not_friend;
ull cnt = 0;
int size_arr = (int) input_data[0].size();

rep(i, 0, q) { // from these users
    rep(j, i + 1, q) { // to these users
        int step = j >> 6; // same as j / 64 but somewhat faster (?)
        int remainder = j % 64;

        not_friend = (input_data[i].at(step) >> remainder) % 2 == 0;

        if (not_friend) {
            ull counter = 0ULL;

            vector<ull> &v1 = input_data[i];
            vector<ull> &v2 = input_data[j];

            for (int c = 0; c < size_arr; ++c) {
                counter |= v1[c] & v2[c];
            }

            if (counter != 0)
                cnt++;
        }
    }
}

return cnt << 1;

Interestingly, if you change the variable counter to a boolean, the elapsed time doubles (21ms)! That's because it breaks the SIMD parallelization (my hypothesis is that the SIMD registers are expecting 512 bits and turning 64 bits (ull) into 1 bit (bool) may make the data unsuitable for these registers). 


Results:




Code in git: link

Conclusion:

Using SIMD gives some serious speed advantage but in return, it restricts your code to a few select operations (for most part, arithmetics and bitwise operations) as well as make it very difficult to use control flow (for instance, breaking from the loop if a condition is satisfied).  That said, you can get the speed up by writing "normal" (yet SIMD friendly) code without the hassle of hand-tuned instructions that some folks had to go through (intrinsics), possibly because compilers are much better at finding these opportunities for you, leaving only a few instances where you have to write things manually (e.g. this set intersection paper due to complex logic).

Now, should you try to rewrite your high-level code to use SIMD? Probably not as most of our day-to-day libraries (e.g. numpy) have optimized code which will likely be a lot faster than your newly written C++ code. That said, when you have to write a loop in pure Python / R, SIMD optimizations are rare so your code performance is penalized twice: No vectorization + interpreted vs. compiled with impact in the range on speed of 100X common.

If you'd like to read more,  Daniel Lemire has great insights in these low level considerations. I highly recommend his blog.




Comments

Popular posts from this blog

Should you ship this feature?

My new job at Lyft

5 rules for a productive Science team