Playing around with Vectorization
While I'm on paternity leave, I'm enjoying a bit of time off to write a blog article. Here we go!
Vectorization of code is one of the key tenets of performance computing. Unfortunately, these aspects are quite hidden in high-level programming language (Python/R/SQL) and Data Scientists are unaware of the inner workings. As a matter of fact, even Software Engineer at Tech companies rarely go that low.
This post aims at rediscovering them. Let's dive in!
Problem Statement:
Let's take a simple problem: A social media company wants to send invites to folks that have a 2nd degree connection (i.e. at least 1 friend in common) but are not friends yet. The problem gives you a N x N matrix containing bits (1=friend). How many emails would you send?
Context: There are better ways to solve this problem (for instance, by storing the index of the friends as opposed to all bits). Here, we are taking the data format as a constraint.
Implementations:
0- Baseline approach
We start with the most intuitive approach: store the matrix into a vector<vector<bool>>. The algorithm goes this way:
- For each user i
- For each user j
- if i and j are not already friends
- Check if there is a user c such as c is both friends with i and j
This approach has O(n^3) time complexity and runs in 39ms on my computer (Macbook Pro Intel i-7).
The code looks like this:
// short-hand
#define rep(i, a, b) for (int i = (a); i < (int) (b); ++i)
auto input = load_as_vector(file_name); int q = input.first; // number of users vector<vector<bool>> &input_data = input.second; bool not_friend; ull cnt = 0; // email to send rep(i, 0, q) { // from these users rep(j, i + 1, q) { // to these users not_friend = !input_data[i][j]; if (not_friend) { bool flag = false; rep(c, 0, q) { flag = input_data[i][c] & input_data[j][c]; if (flag) { cnt++; break; } } } }
return cnt << 1;
1- Bitset approach
A smarter approach would be to use the specialized data structure (bitset) that can do the heavy lifting for us. Instead of having a third loop, one can just do a bitwise-AND between the two relationship bitsets.
This approach has the same complexity but is much faster (>2X), clocking at 16ms. The code now looks like that (note that the 3rd loop has been replaced but the bitwise-AND):
auto input = load_as_bitset(file_name); int q = input.first; vector<bitset<2000>> &input_data = input.second; bool not_friend = false; ull cnt = 0; rep(i, 0, q) { // from these users rep(j, i + 1, q) { // to these users not_friend = !input_data[i][j]; if (not_friend) { bitset<2000> common_friends = input_data[i] & input_data[j];
if (common_friends.any()) cnt += 1; } }
return cnt << 1;
2- Bitpacking + SIMD approach
The idea behind this approach is to pack 64 relationships into a single integer (an Unsigned Long Long or ULL). Using Unsigned makes sense as it uses 100% of the capacity (whereas a Signed int would be at 63/64). To get a specific relationship (say with user n), we just have to do a bitshift: (i >> n) & 1.
Our solution now clocks at 11ms so over 3X faster than the baseline approach (despite doing more computation) and 30% faster than the Bitset approach (which is a bit surprising to me).
Why is it so fast? Well, it's structured in a way to allow the compiler a lot of computation in parallel:
- One can check 64 relationships at the time doing a bitwise AND between two ULL
- Furthermore, the 3rd loop is nicely set up as a SIMD reduction (running vectorized AND and reducing with vectorized OR).
Digging deeper into #2, let's look at the assembly code of the 3rd loop in Godbolt. You can see the 2 vectorized instructions show up (VPANDQ for vectorized AND and VPORQ for vectorized OR). Nice!
Now, depending on your compiler, this could compare up to 512 relationships (that is, 64 * 8ULL) at a time! What a speed-up!
The code looks like this (with some help from SO):
auto input = load_as_packed_ull(file_name); int q = input.first; auto &input_data = input.second; bool not_friend; ull cnt = 0; int size_arr = (int) input_data[0].size(); rep(i, 0, q) { // from these users rep(j, i + 1, q) { // to these users int step = j >> 6; // same as j / 64 but somewhat faster (?) int remainder = j % 64; not_friend = (input_data[i].at(step) >> remainder) % 2 == 0; if (not_friend) { ull counter = 0ULL; vector<ull> &v1 = input_data[i]; vector<ull> &v2 = input_data[j]; for (int c = 0; c < size_arr; ++c) { counter |= v1[c] & v2[c]; } if (counter != 0) cnt++; } } }
return cnt << 1;
Interestingly, if you change the variable counter to a boolean, the elapsed time doubles (21ms)! That's because it breaks the SIMD parallelization (my hypothesis is that the SIMD registers are expecting 512 bits and turning 64 bits (ull) into 1 bit (bool) may make the data unsuitable for these registers).
Conclusion:
Using SIMD gives some serious speed advantage but in return, it restricts your code to a few select operations (for most part, arithmetics and bitwise operations) as well as make it very difficult to use control flow (for instance, breaking from the loop if a condition is satisfied). That said, you can get the speed up by writing "normal" (yet SIMD friendly) code without the hassle of hand-tuned instructions that some folks had to go through (intrinsics), possibly because compilers are much better at finding these opportunities for you, leaving only a few instances where you have to write things manually (e.g. this set intersection paper due to complex logic).
Now, should you try to rewrite your high-level code to use SIMD? Probably not as most of our day-to-day libraries (e.g. numpy) have optimized code which will likely be a lot faster than your newly written C++ code. That said, when you have to write a loop in pure Python / R, SIMD optimizations are rare so your code performance is penalized twice: No vectorization + interpreted vs. compiled with impact in the range on speed of 100X common.
If you'd like to read more, Daniel Lemire has great insights in these low level considerations. I highly recommend his blog.
Comments
Post a Comment