diff options
Diffstat (limited to 'src/checksum/power8/README.md')
-rw-r--r-- | src/checksum/power8/README.md | 208 |
1 files changed, 208 insertions, 0 deletions
diff --git a/src/checksum/power8/README.md b/src/checksum/power8/README.md new file mode 100644 index 00000000000..3e2976650cd --- /dev/null +++ b/src/checksum/power8/README.md @@ -0,0 +1,208 @@ +crc32-vpmsum +============ + +A set of examples for accelerating CRC32 calculations using the vector +polynomial multiply sum (vpmsum) instructions introduced in POWER8. These +instructions implement byte, halfword, word and doubleword carryless +multiply/add. + +Performance +----------- + +An implementation of slice-by-8, one of the fastest lookup table methods +is included so we can compare performance against it. Testing 5000000 +iterations of a CRC of 32 kB of data (to keep it L1 cache contained): + +``` +# time slice_by_8_bench 32768 5000000 +122.220 seconds + +# time crc32_bench 32768 5000000 +2.937 seconds +``` + +The vpmsum accelerated CRC is just over 41x faster. + +This test was run on a 4.1 GHz POWER8, so the algorithm sustains about +52 GiB/sec or 13.6 bytes/cycle. The theoretical limit is 16 bytes/cycle +since we can execute a maximum of one vpmsum instruction per cycle. + +In another test, a version was added to the kernel and btrfs write +performance was shown to be 3.8x faster. The test was done to a ramdisk +to mitigate any I/O induced variability. + +Quick start +----------- + +- Modify CRC and OPTIONS in the Makefile. There are examples for the two most + common crc32s. + +- Type make to create the constants (crc32_constants.h) + +- Import the code into your application (crc32.S crc32_wrapper.c + crc32_constants.h ppc-opcode.h) and call the CRC: + +``` +unsigned int crc32_vpmsum(unsigned int crc, unsigned char *p, unsigned long len); +``` + +CRC background +-------------- + +For a good background on CRCs, check out: + +http://www.ross.net/crc/download/crc_v3.txt + +A few key points: + +- A CRC is the remainder after dividing a message by the CRC polynomial, + ie M mod CRC_POLY +- multiply/divide is carryless +- add/subtract is an xor +- n (where n is the order of the CRC) bits of zeroes are appended to the + end of the message. + +One more important piece of information - a CRC is a linear function, so: + +``` + CRC(A xor B) = CRC(A) xor CRC(B) + + CRC(A . B) = CRC(A) . CRC(B) (remember this is carryless multiply) +``` + +If we take 64bits of data, represented by two 32 bit chunks (AAAAAAAA +and BBBBBBBB): + +``` +CRC(AAAAAAAABBBBBBBB) + = CRC(AAAAAAAA00000000 xor BBBBBBBB) + = CRC(AAAAAAAA00000000) xor CRC(BBBBBBBB) +``` + +If we operate on AAAAAAAA: + +``` +CRC(AAAAAAAA00000000) + = CRC(AAAAAAAA . 100000000) + = CRC(AAAAAAAA) . CRC(100000000) +``` + +And CRC(100000000) is a constant which we can pre-calculate: + +``` +CRC(100000000) + = 100000000 mod CRC_POLY + = 2^32 mod CRC_POLY +``` + +Finally we can add our modified AAAAAAAA to BBBBBBBB: + +``` +CRC(AAAAAAAABBBBBBBB) + = ((2^32 mod CRC_POLY) . CRC(AAAAAAAA)) xor CRC(BBBBBBBB) +``` + +In other words, with the right constants pre-calculated we can shift the +input data around and we can also calculate the CRC in as many parallel +chunks as we want. + +No matter how much shifting we do, the final result will be be 64 bits of +data (63 actually, because there is no carry into the top bit). To reduce +it further we need a another trick, and that is Barrett reduction: + +http://en.wikipedia.org/wiki/Barrett_reduction + +Barrett reduction is a method of calculating a mod n. The idea is to +calculate q, the multiple of our polynomial that we need to subtract. By +doing the computation 2x bits higher (ie 64 bits) and shifting the +result back down 2x bits, we round down to the nearest multiple. + +``` + k = 32 + m = floor((4^k)/n) = floor((4^32))/n) + n = 64 bits of data + a = 32 bit CRC + + q = floor(ma/(2^64)) + result = a - qn +``` + +An example in the floating point domain makes it clearer how this works: + +``` +a mod n = a - floor(am) * n +``` + +Let's use it to calculate 22 mod 10: + +``` + a = 22 + n = 10 + m = 1/n = 1/10 = 0.1 + +22 mod 10 + = 22 - floor(22*0.1) * 10 + = 22 - 2 * 10 + = 22 - 20 + = 2 +``` + +There is one more issue left - bit reflection. Some CRCs are defined to +operate on the least significant bit first (eg CRC32c). Lets look at +how this would get laid out in a register, and lets simplify it to just +two bytes (vs a 16 byte VMX register): + + [ 8..15 ] [ 0..7 ] + +Notice how the bits and bytes are out of order. Since we are doing +multi word multiplication on these values we need them to both be +in order. + +The simplest way to fix this is to reflect the bits in each byte: + + [ 15..8 ] [ 7..0 ] + +However shuffling bits in a byte is expensive on most CPUs. It is +however relatively cheap to shuffle bytes around. What if we load +the bytes in reversed: + + [ 0..7 ] [ 8..15 ] + +Now the bits and bytes are in order, except the least significant bit +of the register is now on the left and the most significant bit is on the +right. We operate as if the register is reflected, which normally we +cannot do. The reason we get away with this is our multiplies are carryless +and our addition and subtraction is xor, so our operations never create +carries. + +The only trick is we have to shift the result of multiplies left one +because the high bit of the multiply is always 0, and we want that high bit +on the right not the left. + +Implementation +-------------- + +The vpmsum instructions on POWER8 have a 6 cycle latency and we can +execute one every cycle. In light of this the main loop has 8 parallel +streams which consume 8 x 16 B each iteration. At the completion of this +loop we have taken 32 kB of data and reduced it to 8 x 16 B (128 B). + +The next step is to take this 128 B and reduce it to 8 B. At this stage +we also add 32 bits of 0 to the end. + +We then apply Barrett reduction to get our CRC. + +Examples +-------- +- barrett_reduction: An example of Barrett reduction + +- final_fold: Starting with 128 bits, add 32 bits of zeros and reduce it to + 64 bits, then apply Barrett reduction + +- final_fold2: A second method of reduction + +Acknowledgements +---------------- + +Thanks to Michael Gschwind, Jeff Derby, Lorena Pesantez and Stewart Smith +for their ideas and assistance. |