| Commit message (Collapse) | Author | Age | Files | Lines |
| |
|
| |
|
|
|
|
| |
https://github.com/apache/couchdb-erlfdb/releases/tag/v1.2.2
|
|
|
|
|
|
| |
In an overload scenario do not let notifiers crash and lose their subscribers,
instead make them more robust and let them retry on future or transaction
timeouts.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Optimize couch_views by using a separate set of acceptors and workers.
Previously, all `max_workers` where spawned on startup, and were to
waiting to accept jobs in parallel. In a setup with a large number of
pods, and 100 workers per pod, that could lead to a lot of conflicts
being generated when all those workers race to accept the same job at
the same time.
The improvement is to spawn only a limited number of acceptors (5, by
default), then, spawn more after some of them become workers. Also,
when some workers finish or die with an error, check if more acceptors
could be spawned.
As an example, here is what might happen with `max_acceptors = 5` and
`max_workers = 100` (`A` and `W` are the current counts of acceptors
and workers, respectively):
1. Starting out:
`A = 5, W = 0`
2. After 2 acceptors start running:
`A = 3, W = 2`
Then immediately 2 more acceptors are spawned:
`A = 5, W = 2`
3. After 95 workers are started:
`A = 5, W = 95`
4. Now if 3 acceptors accept, it would look like:
`A = 2, W = 98`
But no more acceptors would be started.
5. If the last 2 acceptors also accept jobs: `A = 0, W = 100` At this
point no more indexing jobs can be accepted and started until at
least one of the workers finish and exit.
6. If 1 worker exits:
`A = 0, W = 99`
An acceptor will be immediately spawned
`A = 1, W = 99`
7. If all 99 workers exit, it will go back to:
`A = 5, W = 0`
|
|
|
|
|
|
|
|
| |
As per ML
[discussion](https://lists.apache.org/thread.html/rb328513fb932e231cf8793f92dd1cc2269044cb73cb43a6662c464a1%40%3Cdev.couchdb.apache.org%3E)
add a `uuid` field to db info results in order to be able to uniquely identify
a particular instance of a database. When a database is deleted and re-created
with the same name, it will return a new `uuid` value.
|
|
|
|
|
|
|
|
|
|
|
|
| |
When waiting to accept jobs and scheduling was used, timeout is limited based
on the time scheduling parameter. When no_schedule option is used, time
scheduling parameter is set to 0 always, and so in that case, we have to
special-case the limit to return `infinity`.
Later on when we wait for the watch to fire, the actual timeout can still be
limited, by a separate user specified timeout option, but if user specifies
`infinity` there and sets `#{no_schedule => true}` then we should respect and
never return `{error, not_found}` in response.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Use the `no_schedule` option to speed up job dequeuing. This optimization
allows dequeuing jobs more efficiently if these conditions are met:
1) Job IDs start with a random prefix
2) No time-based scheduling is used
Both of those can be true for views job IDs can be generated such that
signature comes before the db name part, which is what this commit does.
The way the optimization works, is random IDs are generating in pending jobs
range, then, a key selection is used to pick either a job before or after
it. That reduces each dequeue attempt to just 1 read instead of reading up to
1000 jobs.
|
|
|
|
|
|
| |
Under load accept loop can blow up with timeout error from
`erlfdb:wait(...)`(https://github.com/apache/couchdb-erlfdb/blob/master/src/erlfdb.erl#L255)
so guard against it just like we do for fdb transaction timeout (1031) errors.
|
|
|
|
|
|
| |
Update db handles right away as soon we db verison is checked. This ensures
concurrent requests will get access to the current handle as soon as possible
and may avoid doing extra version checks and re-opens.
|
|
|
|
|
|
| |
Check metadata versions to ensure newer handles are not clobbered. The same
thing is done for removal, `maybe_remove/1` removes handle only if there isn't
a newer handle already there.
|
|
|
|
| |
And also against too many conflicts during overload
|
| |
|
| |
|
| |
|
| |
|
|
|
|
|
|
|
| |
Let users specify the maximum document count for the _bulk_docs requests. If
the document count exceeds the maximum it would return a 413 HTTP error. This
would also signal the replicator to try to bisect the _bulk_docs array into
smaller batches.
|
|
|
|
| |
Observed a number of timeouts with the previous default
|
|
|
|
|
| |
In a constrained CI environment transactions could retry multiple times so we
cannot rely on precisely counting erlfdb:transactional/2 calls.
|
|
|
|
|
|
|
|
|
|
| |
Increase couch_views job timeout by 20 seconds. This will set a larger jitter
when multiple nodes concurrently check and re-equeue jobs. It would reduce the
chance of them bumping into each other and conflicting.
If they do conflict in activity monitor, catch the error and emit an error log.
We gain some more robustness under load for a longer timeout for jobs with
workers that have suddenly died getting re-enqueued.
|
|\
| |
| | |
Fix handling of limit query parameter
|
| | |
|
|\ \
| | |
| | | |
Improve log of permanently deleting databases
|
|/ / |
|
|/
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* Interactive (regular) requests are split into smaller transactions, so
larger updates won't fail with either timeout so or transaction too large
FDB errors.
* Non-interactive (replicated) requests can now batch their updates in a few
transaction and gain extra performance.
Batch size is configurable:
```
[fabric]
update_docs_batch_size = 5000000
```
|
|
|
|
|
|
|
| |
Sometimes this test fails on Jenkins but doesn't fail locally. The attempted
fix is to make sure to simply retry a few times for the number of children in
the supervisor to be the expected values. Also extend the timeout to 15
seconds.
|
|\
| |
| | |
Pagination API
|
| | |
|
| | |
|
| | |
|
| | |
|
|/ |
|
| |
|
|
|
|
|
|
| |
Add some longer timeouts and fix a race condition in db cleanup tests
(Thanks to @jdoane for the patch)
|
|\
| |
| | |
Background database deletion
|
|/
|
|
|
|
|
|
| |
allow background job to delete soft-deleted database according to
specified criteria to release space. Once database is hard-deleted,
the data can't be fetched back.
Co-authored-by: Nick Vatamaniuc<vatamane@apache.org>
|
|
|
|
|
|
|
|
|
|
|
|
| |
Previously we always returned `false` because the result from
`couch_jobs:get_job_state` was expected to be just `Status`, but it is `{ok,
Status}`. That part is now explicit so we account for every possible job state
and would fail on a clause match if we get something else there.
Moved `job_state/2` function to `couch_view_jobs` to avoid duplicating the
logic on how to calculate job_id and keep it all in one module.
Tests were updated to explicitly check for each state job state.
|
| |
|
| |
|
| |
|
|\
| |
| | |
Re-enable ExUnit tests
|
| | |
|
|/ |
|
| |
|
| |
|
|
|
|
|
| |
On CI creating a 100 dbs in a row was too much to do in 5 seconds so bump it to
15.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
In its current incarnation, the so-called "simple lifecycle" test is
prone to numerous failures in the CI system [1], doubtless because it's
riddled with race conditions. The original author makes many assumptions
about how quickly an (actual, unmocked) FDB instance will respond to a
request.
The primary goal is to stop failing CI builds, while other
considerations include: keeping the run time of the test as low as
possible, keeping the code coverage high, and documenting the known
races.
Specifically:
- Increase the `stale` and `expired` times by a factor of 5 to decrease
sensitivity to poor FDB performance.
- Change default timer from `erlang:system_time/1` to `os:timestamp` on
the assumption that the latter is less prone to warping [2].
- Decrease the period of the cache server reaper by half to increase
accuracy of eviction time.
- Inline and modify the `test_util:wait` code to make the timer
explicit, and emphasize that `timer:delay/1` only works with millisecond
resolution.
- Don't fail the test if it can't get a fresh lookup immediately after
insertion, but let it continue on to the next race, at least to the
point of expiration and deletion, which continue to be asserted.
- Factor `Timeout` and `Interval` to allow declarations near the other
hard-coded parameters.
- Move cache server `Opts` into `setup/0` and eliminate `start_link/0`.
- Double the overall test timeout to 20 seconds.
This has soaked for hundreds of runs on a 5 year old laptop, but the
real test is the CI system.
Should this test continue to fail CI builds, additional improvements
could include mocking the timer and/or FDB layer to eliminate the
variability of an integrated system.
[1] https://ci-couchdb.apache.org/blue/organizations/jenkins/jenkins-cm1%2FPullRequests/detail/PR-2813/10/pipeline
[2] http://erlang.org/doc/apps/erts/time_correction.html#terminology
|
|
|
|
|
|
|
| |
And an extra level of error checking to erlfdb:set_option since it could fail
if we forget to update erlfdb dependency or fdb server version is too old. That
operation can fail with an error:badarg which is exactly how list_to_integer
fails and result in a confusing log message.
|
| |
|