atp

Atp's external memory

leading zeros - builtins and FFSL

I took the time to start using Mike's HDR histogram port to C the other day. It was pretty painless to use, once I'd overcome a couple of initial problems. One of which is that the C compiler I have on CentOS 6.4 was too old. So I came up with a workaround for older compilers/machines. 

HDRHistogram is Azul's high dynamic range histogram. The readme on github explains what it is. Suffice it to say that I'm in the process of dumping the various implementations of histogram code I've evolved over the years in favour of this for latency and performance management.

So, in one routine Mike uses __lzcnt64() - which is an intrinsic only available in recent versions of gcc and clang as part of the bucket indexing code. My ancient opteron running gcc 4.6 and CentOS 6.4 machines running even older gcc 4.4.6 don't have that. 

Here's the routine;

static int32_t get_bucket_index(struct hdr_histogram* h, int64_t value)
{
// smallest power of 2 containing value
    int32_t pow2ceiling = 64 - __lzcnt64(value | h->sub_bucket_mask);
    return pow2ceiling - (h->sub_bucket_half_count_magnitude + 1);
}

Which we can rewrite as;

static int32_t get_power_of_two(int64_t value)
{
   return 64 - __lzcnt64(value); // smallest power of 2 containing value
}

static int32_t get_bucket_index(struct hdr_histogram* h, int64_t value)
{
    int32_t pow2ceiling = get_power_of_two(value | h->sub_bucket_mask);
    return pow2ceiling - (h->sub_bucket_half_count_magnitude + 1);
}

 I got misled initially by the "smallest power of 2 containing value" comment. Its actually "find the most significant bit" which is exactly what lzcnt64 does. 

The obvious thing to do - for gcc - is this;

static int32_t get_power_of_two_gcc(int64_t x)
{
     return 64 - __builtin_clzl(x);
}

 However we want to compile it on clang too. As it happens, clang supports that builtin, but it got me thinking about how to do this in vanilla C without relying on fancy builtins that may or may not work. This means basic library calls only. 

So, the alternative implementation;

static int32_t get_power_of_two(int64_t x)
{
 x |= x >> 1;
 x |= x >> 2;
 x |= x >> 4;
 x |= x >> 8;
 x |= x >> 16;
 x |= x >> 32;
 return ffsl(x+1)-1;
}

 Which uses a variant on the rounding down bitshift algorithm from "hackers delight" to give us the number of the highest bit set. This compiles down to a handful of fast primitive assembly ops. Loads, ors and a bsr for the find first set bit. 

As always there's bound to be a cleverer/faster way of doing this. For example the ffsl(x+1)-1 could be replaced with a ffsl(x - (x>>1)). That may be faster. Or not... 

Either way, I now can use the fast shiny new toy on my ancient hardware/software. We could use feature test macros to switch between __lzcnt64() or __builtin_clzl(), however I think I'll just use the gcc builtin. It seems to work under clang too, and looking at the assembly, its a bit more compact than the vanilla version. 

So that was a mini adventure.

Written by atp

Wednesday 26 March 2014 at 2:09 pm

Posted in Default, Linux

Leave a Reply