diff options
author | Nick Vatamaniuc <vatamane@gmail.com> | 2022-10-20 23:17:57 -0400 |
---|---|---|
committer | Nick Vatamaniuc <nickva@users.noreply.github.com> | 2022-10-25 23:06:49 -0400 |
commit | 4e9c5588d765b84742784de5aafa146d14eed11f (patch) | |
tree | 1f57da112f1cf1a7384166983d9c282480dc9470 /rel | |
parent | 63d71bec8ea9b59660645830ff822deef7d698c1 (diff) | |
download | couchdb-4e9c5588d765b84742784de5aafa146d14eed11f.tar.gz |
Optimize _bulk_get endpoint
Use new `fabric:open_revs/3` API implemented in #4201 to optimize the _bulk_get
HTTP API. Since `open_revs/3` itself is new, allow reverting to individual doc
fetches using the previous `open_revs/4` API via a config setting, mostly as a
precautionary measure.
The implementation consists of three main parts:
* Parse and validate args
* Fetch the docs using `open_revs/3` or `open_revs/4`
* Emit results as json or multipart, based on the `Accept` header value
Parsing and validation checks for various errors and then returns a map of
`#{Ref => {DocId, RevOrError, DocOptions}}` and a list of Refs in the original
argument order. The middle tuple element of `RevOrError` is notable that it may
hold either the revision ID (`[Rev]` or `all`) or `{error, {Rev, ErrorTag,
ErrorReason}}`.
Fetching the docs is fairly straightforward. The slightly interesting aspect is
when an error is returned from `open_revs/3` we have to pretend that all the
batched docs failed with that error. That is done to preserve the "zip"
property, where all the input arguments have their matching result at the same
position in results list. Another notable thing here is we fixed a bug where
the error returned from `fabric:open_revs/3,4` was not formatted in a way it
could have been emitted as json resulting in a function clause. That is why we
call `couch_util:to_binary/1` on it. This was detected by the integration
testing outline before and was missed by the previous mocked unit test.
The last part is emitting the results as either json or multipart. Here most
changes are cleanups and grouping into separate handler functions. The `Accept`
header can be either `multipart/related` or `multipart/mixed` and we try to
emit the same content type as it was passed in the `Accept` header. One notable
thing here is by DRY-ing the filtering of attachments in
`non_stubbed_attachments/1` we fixed another bug when the multipart result was
returning nonsense in cases when all attachments were stubs. The doc was
returned as a multipart chunk with content type `multipart/...` instead of
application/json. This was also detected in the integration tests described
below.
The largest changes are in the testing area. Previous multipart tests were
using mocks heavily, were quite fragile, and didn't have good coverage. Those
tests were removed and replaced by new end-to-end tests in
`chttpd_bulk_get_test.erl`. To make that happen add a simple multipart parser
utility function which knows how to parse multipart responses into maps. Those
maps preserve chunk headers and we can match those with `?assertMatch(...)`
fairly easily. The tests try to get decent coverage for `chttpd_db.erl`
bulk_get implementation and its utility functions, but they are also end-to-end
tests so they test everything below, including fabric and couch layers as well.
Quick 1 node testing using the couchdyno replicating of 1 million docs shows at
least a 2x speedup to complete the replication using this PR.
On main:
```
r=rep.Rep(); r.replicate_1_to_n_and_compare(1, num=1000000, normal=True)
330 sec
```
With this PR:
```
r=rep.Rep(); r.replicate_1_to_n_and_compare(1, num=1000000, normal=True)
160 sec
```
Individual `_bulk_get` response times shows an even higher improvement: an 8x
speedup:
On main:
```
[notice] ... POST /cdyno-0000001/_bulk_get?latest=true&revs=true&attachments=false 200 ok 468
[notice] ... POST /cdyno-0000001/_bulk_get?latest=true&revs=true&attachments=false 200 ok 479
```
With this PR:
```
[notice] ... POST /cdyno-0000001/_bulk_get?latest=true&revs=true&attachments=false 200 ok 54
[notice] ... POST /cdyno-0000001/_bulk_get?latest=true&revs=true&attachments=false 200 ok 61
```
Fixes: https://github.com/apache/couchdb/issues/4183
Diffstat (limited to 'rel')
-rw-r--r-- | rel/overlay/etc/default.ini | 6 |
1 files changed, 6 insertions, 0 deletions
diff --git a/rel/overlay/etc/default.ini b/rel/overlay/etc/default.ini index 1b1f6111d..6bb2ef475 100644 --- a/rel/overlay/etc/default.ini +++ b/rel/overlay/etc/default.ini @@ -182,6 +182,12 @@ bind_address = 127.0.0.1 ; Set to true to decode + to space in db and doc_id parts. ; decode_plus_to_space = true +; Set to false to revert to a previous _bulk_get implementation using single +; doc fetches internally. Using batches should be faster, however there may be +; bugs in the new new implemention, so expose this option to allow reverting to +; the old behavior. +;bulk_get_use_batches = true + ;[jwt_auth] ; List of claims to validate ; can be the name of a claim like "exp" or a tuple if the claim requires |