summaryrefslogtreecommitdiff
path: root/doc/specs/thrift-compact-protocol.md
blob: 07cc5626ebedecf55263cc6126b577a6e6d63b27 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
Thrift Compact protocol encoding 
================================

<!--
--------------------------------------------------------------------

Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at

  http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.

--------------------------------------------------------------------
-->

This documents describes the wire encoding for RPC using the Thrift *compact protocol*.

The information here is _mostly_ based on the Java implementation in the Apache thrift library (version 0.9.1) and
[THRIFT-110 A more compact format](https://issues.apache.org/jira/browse/THRIFT-110). Other implementation however,
should behave the same.

For background on Thrift see the [Thrift whitepaper (pdf)](https://thrift.apache.org/static/files/thrift-20070401.pdf).

# Contents

* Compact protocol
  * Base types
  * Message
  * Struct
  * List and Set
  * Map
* BNF notation used in this document

# Compact protocol

## Base types

### Integer encoding

The _compact protocol_ uses multiple encodings for ints: the _zigzag int_, and the _var int_.

Values of type `int32` and `int64` are first transformed to a *zigzag int*. A zigzag int folds positive and negative
numbers into the positive number space. When we read 0, 1, 2, 3, 4 or 5 from the wire, this is translated to 0, -1, 1,
-2 or 2 respectively. Here are the (Scala) formulas to convert from int32/int64 to a zigzag int and back:

```scala
def intToZigZag(n: Int): Int = (n << 1) ^ (n >> 31)
def zigzagToInt(n: Int): Int = (n >>> 1) ^ - (n & 1)
def longToZigZag(n: Long): Long = (n << 1) ^ (n >> 63)
def zigzagToLong(n: Long): Long = (n >>> 1) ^ - (n & 1)
```

The zigzag int is then encoded as a *var int*, also known as *Unsigned LEB128*.  Var ints take 1 to 5 bytes (int32) or 
1 to 10 bytes (int64). The process consists in taking a Big Endian unsigned integer, left-padding the bit-string to 
make it a multiple of 7 bits, splitting it into 7-bit groups, prefixing the most-significant 7-bit group with the 0 
bit, prefixing the remaining 7-bit groups with the 1 bit and encoding the resulting bit-string in Little Endian.

For example, the integer 50399 is encoded as follows:

```
50399 = 1100 0100 1101 1111         (Big Endian representation)
      = 00000 1100 0100 1101 1111   (Left-padding)
      = 0000011 0001001 1011111     (7-bit groups)
      = 00000011 10001001 11011111  (Most-significant bit prefixes)
      = 11011111 10001001 00000011  (Little Endian representation)
      = 0xDF 0x89 0x03
```

Var ints are sometimes used directly inside the compact protocol to represent positive numbers.

To encode an `int16` as zigzag int, it is first converted to an `int32` and then encoded as such. The type `int8` simply
uses a single byte as in the binary protocol.

### Enum encoding

The generated code encodes `Enum`s by taking the ordinal value and then encoding that as an int32.

### Binary encoding

Binary is sent as follows:

```
Binary protocol, binary data, 1+ bytes:
+--------+...+--------+--------+...+--------+
| byte length         | bytes               |
+--------+...+--------+--------+...+--------+
```

Where:

* `byte length` is the length of the byte array, using var int encoding (must be >= 0).
* `bytes` are the bytes of the byte array.

### String encoding

*String*s are first encoded to UTF-8, and then send as binary. They do not
include a NUL delimiter.

### Double encoding

Values of type `double` are first converted to an int64 according to the IEEE 754 floating-point "double format" bit
layout. Most run-times provide a library to make this conversion. But while the binary protocol encodes the int64 
in 8 bytes in big endian order, the compact protocol encodes it in little endian order - this is due to an early 
implementation bug that finally became the de-facto standard.

### Boolean encoding

Booleans are encoded differently depending on whether it is a field value (in a struct) or an element value (in a set,
list or map). Field values are encoded directly in the field header. Element values of type `bool` are sent as an int8;
true as `1` and false as `0`.

## Message

A `Message` on the wire looks as follows:

```
Compact protocol Message (4+ bytes):
+--------+--------+--------+...+--------+--------+...+--------+--------+...+--------+
|pppppppp|mmmvvvvv| seq id              | name length         | name                |
+--------+--------+--------+...+--------+--------+...+--------+--------+...+--------+
```

Where:

* `pppppppp` is the protocol id, fixed to `1000 0010`, 0x82.
* `mmm` is the message type, an unsigned 3 bit integer.
* `vvvvv` is the version, an unsigned 5 bit integer, fixed to `00001`.
* `seq id` is the sequence id, a signed 32 bit integer encoded as a var int.
* `name length` is the byte length of the name field, a signed 32 bit integer encoded as a var int (must be >= 0).
* `name` is the method name to invoke, a UTF-8 encoded string.

Message types are encoded with the following values:

* _Call_: 1
* _Reply_: 2
* _Exception_: 3
* _Oneway_: 4

### Struct

A *Struct* is a sequence of zero or more fields, followed by a stop field. Each field starts with a field header and
is followed by the encoded field value. The encoding can be summarized by the following BNF:

```
struct        ::= ( field-header field-value )* stop-field
field-header  ::= field-type field-id
```

Because each field header contains the field-id (as defined by the Thrift IDL file), the fields can be encoded in any
order. Thrift's type system is not extensible; you can only encode the primitive types and structs. Therefore is also
possible to handle unknown fields while decoding; these are simply ignored. While decoding the field type can be used to
determine how to decode the field value.

Note that the field name is not encoded so field renames in the IDL do not affect forward and backward compatibility.

The default Java implementation (Apache Thrift 0.9.1) has undefined behavior when it tries to decode a field that has
another field-type than what is expected. Theoretically this could be detected at the cost of some additional checking.
Other implementation may perform this check and then either ignore the field, or return a protocol exception.

A *Union* is encoded exactly the same as a struct with the additional restriction that at most 1 field may be encoded.

An *Exception* is encoded exactly the same as a struct.

### Struct encoding

```
Compact protocol field header (short form) and field value:
+--------+--------+...+--------+
|ddddtttt| field value         |
+--------+--------+...+--------+

Compact protocol field header (1 to 3 bytes, long form) and field value:
+--------+--------+...+--------+--------+...+--------+
|0000tttt| field id            | field value         |
+--------+--------+...+--------+--------+...+--------+

Compact protocol stop field:
+--------+
|00000000|
+--------+
```

Where:

* `dddd` is the field id delta, an unsigned 4 bits integer, strictly positive.
* `tttt` is field-type id, an unsigned 4 bit integer.
* `field id` the field id, a signed 16 bit integer encoded as zigzag int.
* `field-value` the encoded field value.

The field id delta can be computed by `current-field-id - previous-field-id`, or just `current-field-id` if this is the
first of the struct. The short form should be used when the field id delta is in the range 1 - 15 (inclusive).

The following field-types can be encoded:

* `BOOLEAN_TRUE`, encoded as `1`
* `BOOLEAN_FALSE`, encoded as `2`
* `I8`, encoded as `3`
* `I16`, encoded as `4`
* `I32`, encoded as `5`
* `I64`, encoded as `6`
* `DOUBLE`, encoded as `7`
* `BINARY`, used for binary and string fields, encoded as `8`
* `LIST`, encoded as `9`
* `SET`, encoded as `10`
* `MAP`, encoded as `11`
* `STRUCT`, used for both structs and union fields, encoded as `12`

Note that because there are 2 specific field types for the boolean values, the encoding of a boolean field value has no
length (0 bytes).

## List and Set

List and sets are encoded the same: a header indicating the size and the element-type of the elements, followed by the
encoded elements.

```
Compact protocol list header (1 byte, short form) and elements:
+--------+--------+...+--------+
|sssstttt| elements            |
+--------+--------+...+--------+

Compact protocol list header (2+ bytes, long form) and elements:
+--------+--------+...+--------+--------+...+--------+
|1111tttt| size                | elements            |
+--------+--------+...+--------+--------+...+--------+
```

Where:

* `ssss` is the size, 4 bit unsigned int, values `0` - `14`
* `tttt` is the element-type, a 4 bit unsigned int
* `size` is the size, a var int (int32), positive values `15` or higher
* `elements` are the encoded elements

The short form should be used when the length is in the range 0 - 14 (inclusive).

The following element-types are used (see note below):

* `BOOL`, encoded as `2`
* `I8`, encoded as `3`
* `I16`, encoded as `4`
* `I32`, encoded as `5`
* `I64`, encoded as `6`
* `DOUBLE`, encoded as `7`
* `BINARY`, used for binary and string fields, encoded as `8`
* `LIST`, encoded as `9`
* `SET`, encoded as `10`
* `MAP`, encoded as `11`
* `STRUCT`, used for structs and union fields, encoded as `12`

*Note*: Although field-types and element-types lists are currently very similar, there is _no guarantee_ that this will
remain true after new types are added.

The maximum list/set size is configurable. By default there is no limit (meaning the limit is the maximum int32 value:
2147483647).

## Map

Maps are encoded with a header indicating the size, the type of the keys and the element-type of the elements, followed
by the encoded elements. The encoding follows this BNF:

```
map           ::= empty-map | non-empty-map
empty-map     ::= `0`
non-empty-map ::= size key-element-type value-element-type (key value)+
```

```
Compact protocol map header (1 byte, empty map):
+--------+
|00000000|
+--------+

Compact protocol map header (2+ bytes, non empty map) and key value pairs:
+--------+...+--------+--------+--------+...+--------+
| size                |kkkkvvvv| key value pairs     |
+--------+...+--------+--------+--------+...+--------+
```

Where:

* `size` is the size, a var int (int32), strictly positive values
* `kkkk` is the key element-type, a 4 bit unsigned int
* `vvvv` is the value element-type, a 4 bit unsigned int
* `key value pairs` are the encoded keys and values

The element-types are the same as for lists. The full list is included in the 'List and set' section.

The maximum map size is configurable. By default there is no limit (meaning the limit is the maximum int32 value:
2147483647).

# BNF notation used in this document

The following BNF notation is used:

* a plus `+` appended to an item represents repetition; the item is repeated 1 or more times
* a star `*` appended to an item represents optional repetition; the item is repeated 0 or more times
* a pipe `|` between items represents choice, the first matching item is selected
* parenthesis `(` and `)` are used for grouping multiple items