src/checksum/power8/README.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208

crc32-vpmsum
============

A set of examples for accelerating CRC32 calculations using the vector
polynomial multiply sum (vpmsum) instructions introduced in POWER8. These
instructions implement byte, halfword, word and doubleword carryless
multiply/add.

Performance
-----------

An implementation of slice-by-8, one of the fastest lookup table methods
is included so we can compare performance against it. Testing 5000000
iterations of a CRC of 32 kB of data (to keep it L1 cache contained):

```
# time slice_by_8_bench 32768 5000000
122.220 seconds

# time crc32_bench 32768 5000000
2.937 seconds
```

The vpmsum accelerated CRC is just over 41x faster.

This test was run on a 4.1 GHz POWER8, so the algorithm sustains about
52 GiB/sec or 13.6 bytes/cycle. The theoretical limit is 16 bytes/cycle
since we can execute a maximum of one vpmsum instruction per cycle.

In another test, a version was added to the kernel and btrfs write
performance was shown to be 3.8x faster. The test was done to a ramdisk
to mitigate any I/O induced variability.

Quick start
-----------

- Modify CRC and OPTIONS in the Makefile. There are examples for the two most
  common crc32s.

- Type make to create the constants (crc32_constants.h)

- Import the code into your application (crc32.S crc32_wrapper.c
  crc32_constants.h ppc-opcode.h) and call the CRC:

```
unsigned int crc32_vpmsum(unsigned int crc, unsigned char *p, unsigned long len);
```

CRC background
--------------

For a good background on CRCs, check out:

http://www.ross.net/crc/download/crc_v3.txt

A few key points:

- A CRC is the remainder after dividing a message by the CRC polynomial,
  ie M mod CRC_POLY
- multiply/divide is carryless
- add/subtract is an xor
- n (where n is the order of the CRC) bits of zeroes are appended to the
  end of the message.

One more important piece of information - a CRC is a linear function, so:

```
	CRC(A xor B) = CRC(A) xor CRC(B)

	CRC(A . B) = CRC(A) . CRC(B)	(remember this is carryless multiply)
```

If we take 64bits of data, represented by two 32 bit chunks (AAAAAAAA
and BBBBBBBB):

```
CRC(AAAAAAAABBBBBBBB)
	= CRC(AAAAAAAA00000000 xor BBBBBBBB)
	= CRC(AAAAAAAA00000000) xor CRC(BBBBBBBB)
```

If we operate on AAAAAAAA:

```
CRC(AAAAAAAA00000000)
	= CRC(AAAAAAAA . 100000000)
	= CRC(AAAAAAAA) . CRC(100000000)
```

And CRC(100000000) is a constant which we can pre-calculate:

```
CRC(100000000)
	= 100000000 mod CRC_POLY
	= 2^32 mod CRC_POLY
```

Finally we can add our modified AAAAAAAA to BBBBBBBB:

```
CRC(AAAAAAAABBBBBBBB)
	= ((2^32 mod CRC_POLY) . CRC(AAAAAAAA)) xor CRC(BBBBBBBB)
```

In other words, with the right constants pre-calculated we can shift the
input data around and we can also calculate the CRC in as many parallel
chunks as we want.

No matter how much shifting we do, the final result will be be 64 bits of
data (63 actually, because there is no carry into the top bit). To reduce
it further we need a another trick, and that is Barrett reduction:

http://en.wikipedia.org/wiki/Barrett_reduction

Barrett reduction is a method of calculating a mod n. The idea is to
calculate q, the multiple of our polynomial that we need to subtract. By
doing the computation 2x bits higher (ie 64 bits) and shifting the
result back down 2x bits, we round down to the nearest multiple.

```
	k = 32
	m = floor((4^k)/n) = floor((4^32))/n)
	n = 64 bits of data
	a = 32 bit CRC

	q = floor(ma/(2^64))
	result = a - qn
```

An example in the floating point domain makes it clearer how this works:

```
a mod n = a - floor(am) * n
```

Let's use it to calculate 22 mod 10:

```
	a = 22
	n = 10
	m = 1/n = 1/10 = 0.1

22 mod 10
	= 22 - floor(22*0.1) * 10
	= 22 - 2 * 10
	= 22 - 20
	= 2
```

There is one more issue left - bit reflection. Some CRCs are defined to
operate on the least significant bit first (eg CRC32c). Lets look at
how this would get laid out in a register, and lets simplify it to just
two bytes (vs a 16 byte VMX register):

    [ 8..15 ] [ 0..7 ]

Notice how the bits and bytes are out of order. Since we are doing
multi word multiplication on these values we need them to both be
in order.

The simplest way to fix this is to reflect the bits in each byte:

    [ 15..8 ] [ 7..0 ]

However shuffling bits in a byte is expensive on most CPUs. It is
however relatively cheap to shuffle bytes around. What if we load
the bytes in reversed:

    [ 0..7 ] [ 8..15 ]

Now the bits and bytes are in order, except the least significant bit
of the register is now on the left and the most significant bit is on the
right. We operate as if the register is reflected, which normally we
cannot do. The reason we get away with this is our multiplies are carryless
and our addition and subtraction is xor, so our operations never create
carries.

The only trick is we have to shift the result of multiplies left one
because the high bit of the multiply is always 0, and we want that high bit
on the right not the left.

Implementation
--------------

The vpmsum instructions on POWER8 have a 6 cycle latency and we can
execute one every cycle. In light of this the main loop has 8 parallel
streams which consume 8 x 16 B each iteration. At the completion of this
loop we have taken 32 kB of data and reduced it to 8 x 16 B (128 B).

The next step is to take this 128 B and reduce it to 8 B. At this stage
we also add 32 bits of 0 to the end.

We then apply Barrett reduction to get our CRC.

Examples
--------
- barrett_reduction: An example of Barrett reduction

- final_fold: Starting with 128 bits, add 32 bits of zeros and reduce it to
  64 bits, then apply Barrett reduction

- final_fold2: A second method of reduction

Acknowledgements
----------------

Thanks to Michael Gschwind, Jeff Derby, Lorena Pesantez and Stewart Smith
for their ideas and assistance.