diff options
author | Lucas De Marchi <lucas.demarchi@intel.com> | 2013-08-22 01:10:13 -0300 |
---|---|---|
committer | Lucas De Marchi <lucas.demarchi@intel.com> | 2013-09-20 01:08:46 -0500 |
commit | 3ba7f59e84857eb4dbe56a68fc7a3ffe8a650393 (patch) | |
tree | 14f75fb0dcf0469fdf2aaeecd658ff97fb97cac5 | |
parent | 6506ddf5a37849049509324eeff72697f94584e3 (diff) | |
download | kmod-3ba7f59e84857eb4dbe56a68fc7a3ffe8a650393.tar.gz |
util: Add ALIGN_POWER2
Add static inline function to align a value to it's next power of 2.
This is commonly done by a SWAR like the one in:
http://aggregate.org/MAGIC/#Next Largest Power of 2
However a microbench shows that the implementation herer is a faster.
It doesn't really impact the possible user of this function, but it's
interesting nonetheless.
Using a x86_64 i7 Ivy Bridge it shows a ~4% advantage by using clz
instead instead of the OR and SHL chain. And this is by using a BSR
since Ivy Bridge doesn't have LZCNT. New Haswell processors have the
LZCNT instruction which can make this even better. ARM also has a CLZ
instruction so it should be better, too.
Code used to test:
...
v = val[i];
t1 = get_cycles(0);
a = ALIGN_POWER2(v);
t1 = get_cycles(t1);
t2 = get_cycles(0);
v = nlpo2(v);
t2 = get_cycles(t2);
printf("%u\t%llu\t%llu\t%d\n", v, t1, t2, v == a);
...
In which val is an array of 20 random unsigned int, nlop2 is the SWAR
implementation and get_cycles uses RDTSC to measure the performance.
Averages:
ALIGN_POWER2: 30 cycles
nlop2: 31.4 cycles
-rw-r--r-- | libkmod/libkmod-util.h | 5 |
1 files changed, 5 insertions, 0 deletions
diff --git a/libkmod/libkmod-util.h b/libkmod/libkmod-util.h index f7f3e90..8a70aeb 100644 --- a/libkmod/libkmod-util.h +++ b/libkmod/libkmod-util.h @@ -51,3 +51,8 @@ do { \ } *__p = (typeof(__p)) (ptr); \ __p->__v = (val); \ } while(0) + +static _always_inline_ unsigned int ALIGN_POWER2(unsigned int u) +{ + return 1 << ((sizeof(u) * 8) - __builtin_clz(u - 1)); +} |