| Commit message (Collapse) | Author | Age | Files | Lines |
| |
|
|\
| |
| | |
Fix `badarg` when querying replicator's _scheduler/docs endpoint
|
|/
|
|
| |
Jira: COUCHDB-3324
|
| |
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
During mem3 startup, 2 paths attempt to call `couch_server:create/2` on
`_dbs`:
```
gen_server:init_it/6
-> mem3_shards:init/1
-> mem3_shards:get_update_seq/0
-> couch_server:create/2
```
and
```
mem3_sync:initial_sync/1
-> mem3_shards:fold/2
-> couch_server:create/2
```
Normally, the first path completes before the second. If the second path
finishes first, the first path fails because it does not expect a
`file_exists` response.
This patch makes `mem3_util:ensure_enxists/1` more robust in the face of
a race to create `_dbs`.
Fixes COUCHDB-3402.
Approved by @davisp and @iilyak
|
|
|
|
| |
Jira: COUCHDB-3324
|
| |
|
|\
| |
| | |
build: pull authors out of subrepos
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
- pulls comitternames out of subrepos
- uses local variables instead of globals
- makes use of our .mailmap file in our main repo
- composable: just source it and pipe it into your process
usage in `couchdb-build-release.sh`:
```sh
sed -e "/^#.*/d" THANKS.in > $RELDIR/THANKS
source "build-aux/print-committerlist.sh"
print_comitter_list >> THANKS;
```
|
| |
| |
| |
| | |
This reverts commit 2689507fc0f4a4a3731df34d3634bb1bcd4afbc3.
|
| | |
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
During mem3 startup, 2 paths attempt to call `couch_server:create/2` on
`_dbs`:
```
gen_server:init_it/6
-> mem3_shards:init/1
-> mem3_shards:get_update_seq/0
-> couch_server:create/2
```
and
```
mem3_sync:initial_sync/1
-> mem3_shards:fold/2
-> couch_server:create/2
```
Normally, the first path completes before the second. If the second path
finishes first, the first path fails because it does not expect a
`file_exists` response.
This patch simply retries mem3_util:ensure_exists/1 once if it gets back
a `file_exists` response. Any failures past this point are not handled.
Fixes COUCHDB-3402.
|
| | |
|
| | |
|
|\ \
| | |
| | | |
Disabling replication startup jitter in Windows makefile
|
|/ /
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
A similar change has already been made to *nix Makefile already.
This is for the Javascript integration test suite to not timeout
when running in Travis. We don't run Windows tests in Travis but
this should speed things a bit, and it's nice to keep both in
sync.
Jira: COUCHDB-3324
|
|\ \
| | |
| | | |
Scheduling Replicator
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
The `_scheduler/jobs` endpoint provides a view of all replications managed by
the scheduler. This endpoint includes more information on the replication than
the `_scheduler/docs` endpoint, including the history of state transitions of
the replication. This part was implemented by Benjamin Bastian.
The `_scheduler/docs` endpoint provides a view of all replicator docs which
have been seen by the scheduler. This endpoint includes useful information such
as the state of the replication and the coordinator node. The implemention of
`_scheduler/docs` mimics closely `_all_docs` behavior: similar pagination,
HTTP request processing and fabric / rexi setup. The algorithm is roughly
as follows:
* http endpoint:
- parses query args like it does for any view query
- parses states to filter by, states are kept in the `extra` query arg
* Call is made to couch_replicator_fabric. This is equivalent to
fabric:all_docs. Here the typical fabric / rexi setup is happening.
* Fabric worker is in `couch_replicator_fabric_rpc:docs/3`. This worker is
similar to fabric_rpc's all_docs handler. However it is a bit more intricate
to handle both replication document in terminal state as well as those which
are active.
- Before emitting it queries the state of the document to see if it is in a
terminal state. If it is, it filters it and decides if it should be
emitted or not.
- If the document state cannot be found from the document. It tries to
fetch active state from local node's doc processor via key based lookup.
If it finds, it can also filter it based on state and emit it or skip.
- If the document cannot be found in the node's local doc processor ETS
table, the row is emitted with a doc value of `undecided`. This will
let the coordinator fetch the state by possibly querying other nodes's
doc processors.
* Coordinator then starts handling messages. This also mostly mimics all_docs.
At this point the most interesting thing is handling `undecided` docs. If
one is found, then `replicator:active_doc/2` is queried. There, all nodes
where document shards live are queries. This is better than a previous
implementation where all nodes were queries all the time.
* The final work happens in `couch_replicator_httpd` where the emitting
callback is. There we only the doc is emitted (not keys, rows, values).
Another thing that happens is the `Total` value is decremented to
account for the always-present _design doc.
Because of this a bunch of stuff was removed. Including an extra view which
was build and managed by the previous implementation.
As a bonus, other view-related parameters such as skip and limit seems to
work out of the box and don't have to be implemented ad-hoc.
Also, most importantly many thanks to Paul Davis for suggesting this approach.
Jira: COUCHDB-3324
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Glue together all the scheduling replicator pieces.
Scheduler is the main component. It can run a large number of replication jobs
by switching between them, stopping and starting some periodically. Jobs
which fail are backed off exponentially. Normal (non-continuous) jobs will be
allowed to run to completion to preserve their current semantics.
Scheduler behavior can configured by these configuration options in
`[replicator]` sections:
* `max_jobs` : Number of actively running replications. Making this too high
could cause performance issues. Making it too low could mean replications jobs
might not have enough time to make progress before getting unscheduled again.
This parameter can be adjusted at runtime and will take effect during next
reschudling cycle.
* `interval` : Scheduling interval in milliseconds. During each reschedule
cycle scheduler might start or stop up to "max_churn" number of jobs.
* `max_churn` : Maximum number of replications to start and stop during
rescheduling. This parameter along with "interval" defines the rate of job
replacement. During startup, however a much larger number of jobs could be
started (up to max_jobs) in short period of time.
Replication jobs are added to the scheduler by the document processor or from
the `couch_replicator:replicate/2` function when called from `_replicate` HTTP
endpoint handler.
Document processor listens for updates via couch_mutlidb_changes module then
tries to add replication jobs to the scheduler. Sometimes translating a
document update to a replication job could fail, either permantly (if document
is malformed and missing some expected fields for example) or temporarily if
it is a filtered replication and filter cannot be fetched. A failed filter
fetch will be retried with an exponential backoff.
couch_replicator_clustering is in charge of monitoring cluster membership
changes. When membership changes, after a configurable quiet period, a rescan
will be initiated. Rescan will shufle replication jobs to make sure a
replication job is running on only one node.
A new set of stats were added to introspect scheduler and doc processor
internals.
The top replication supervisor structure is `rest_for_one`. This means if
a child crashes, all children to the "right" of it will be restarted (if
visualized supervisor hierarchy as an upside-down tree). Clustering,
connection pool and rate limiter are towards the "left" as they are more
fundamental, if clustering child crashes, most other components will be
restart. Doc process or and multi-db changes children are towards the "right".
If they crash, they can be safely restarted without affecting already running
replication or components like clustering or connection pool.
Jira: COUCHDB-3324
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Document processor listens for `_replicator` db document updates, parses those
changes then tries to add replication jobs to the scheduler.
Listening for changes happens in `couch_multidb_changes module`. That module is
generic and is set up to listen to shards with `_replicator` suffix by
`couch_replicator_db_changes`. Updates are then passed to the document
processor's `process_change/2` function.
Document replication ID calculation, which can involve fetching filter code
from the source DB, and addition to the scheduler, is done in a separate
worker process: `couch_replicator_doc_processor_worker`.
Before couch replicator manager did most of this work. There are a few
improvement over previous implementation:
* Invalid (malformed) replication documents are immediately failed and will
not be continuously retried.
* Replication manager message queue backups is unfortunately a common issue
in production. This is because processing document updates is a serial
(blocking) operation. Most of that blocking code was moved to separate worker
processes.
* Failing filter fetches have an exponential backoff.
* Replication documents don't have to be deleted first then re-added in order
update the replication. Document processor on update will compare new and
previous replication related document fields and update the replication job
if those changed. Users can freely update unlrelated (custom) fields in their
replication docs.
* In case of filtered replications using custom functions, document processor
will periodically check if filter code on the source has changed. Filter code
contents is factored into replication ID calculation. If filter code changes
replication ID will change as well.
Jira: COUCHDB-3324
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Over the years utils accumulated a lot of functionality. Clean up a bit by
separating it into specific modules according to semantics:
- couch_replicator_docs : Handle read and writing to replicator dbs.
It includes updating state fields, parsing options from documents, and
making sure replication VDU design document is in sync.
- couch_replicator_filters : Fetch and manipulate replication filters.
- couch_replicator_ids : Calculate replication IDs. Handles versioning and
Pretty formatting of IDs. Filtered replications using user filter functions
incorporate a filter code hash into the calculation, in that case call
couch_replicator_filters module to fetch the filter from the source.
Jira: COUCHDB-3324
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
AIMD: additive increase / multiplicative decrease feedback control algorithm.
https://en.wikipedia.org/wiki/Additive_increase/multiplicative_decrease
This is an algorithm which converges on the available channel capacity.
Each participant doesn't a priori know the capacity and participants don't
communicate or know about each other (so they don't coordinate to divide
the capacity among themselves).
A variation of this is used in TCP congestion control algorithm. This is proven
to converge, while for example, additive increase / additive decrease or
multiplicative increase / multiplicative decrease won't.
A few tweaks were applied to the base control logic:
* Estimated value is an interval (period) instead of a rate. This is for
convenience, as users will probably want to know how much to sleep. But,
rate is just 1000 / interval, so it is easy to transform.
* There is a hard max limit for estimated period. Mainly as a practical concern
as connections sleeping too long will timeout and / or jobs will waste time
sleeping and consume scheduler slots, while others could be running.
* There is a time decay component used to handle large pauses between updates.
In case of large update interval, assume (optimistically) some successful
requests have been made. Intuitively, the more time passes, the less accurate
the estimated period probably is.
* The rate of updates applied to the algorithm is limited. This effectively
acts as a low pass filter and make the algorithm handle better spikes and
short bursts of failures. This is not a novel idea, some alternative TCP
control algorithms like Westwood+ do something similar.
* There is a large downward pressure applied to the increasing interval as it
approaches the max limit. This is done by tweaking the additive factor via
a step function. In practice this has effect of trying to make it a bit
harder for jobs to cross the maximum backoff threshold, as they would be
killed and potentially lose intermediate work.
Main API functions are:
success(Key) -> IntervalInMilliseconds
failure(Key) -> IntervalInMilliseconds
interval(Key) -> IntervalInMilliseconds
Key is any (hashable by phash2) term. Typically would be something like
{Method, Url}. The result from the function is the current period value. Caller
would then presumably choose to sleep for that amount of time before or after
making requests. The current interval can be read with interval(Key) function.
Implementation is sharded ETS tables based on the key and there is a periodic
timer which cleans unused items.
Jira: COUCHDB-3324
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
This commit adds functionality to share connections between
replications. This is to solve two problems:
- Prior to this commit, each replication would create a pool of
connections and hold onto those connections as long as the replication
existed. This was wasteful and cause CouchDB to use many unnecessary
connections.
- When the pool was being terminated, the pool would block while the
socket was closed. This would cause the entire replication scheduler
to block. By reusing connections, connections are never closed by
clients. They are only ever relinquished. This operation is always
fast.
This commit adds an intermediary process which tracks which connection
processes are being used by which client. It monitors clients and
connections. If a client or connection crashes, the paired
client/connection will be terminated.
A client can gracefully relinquish ownership of a connection. If that
happens, the connection will be shared with another client. If the
connection remains idle for too long, it will be closed.
Jira: COUCHDB-3324
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Monitor shards which match a suffix for creation, deletion, and doc updates.
To use implement `couch_multidb_changes` behavior. Call `start_link` with
DbSuffix, with an option to skip design docs (`skip_ddocs`). Behavior
callback functions will be called when shards are created, deleted, found and
updated.
Jira: COUCHDB-3324
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
This module maintains cluster membership information for replication and
provides functions to check ownership of replication jobs.
A cluster membership change is registered only after a configurable
`cluster_quiet_period` interval has passed since the last node addition or
removal. This is useful in cases of rolling node reboots in a cluster in order
to avoid rescanning for membership changes after every node up and down event,
and instead doing only on rescan at the very end.
Jira: COUCHDB-3324
|
|/ /
| |
| |
| |
| |
| |
| |
| | |
Scheduling replicator can run a large number of replication jobs by scheduling
them. It will periodically stop some jobs and start new ones. Jobs that fail
will be penalized with an exponential backoff.
Jira: COUCHDB-3324
|
|\ \
| | |
| | | |
Revert add sys dbs to lru
|
| | | |
|
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
This reverts commit 92c25a98666e59ca497fa5732a81e771ff52e07d.
Conflicts:
src/couch/src/couch_lru.erl
|
| | |
| | |
| | |
| | | |
This reverts commit 2d984b59ed6886a179fd500efe9780a0d0ad62e3.
|
| | | |
|
|\ \ \
| | | |
| | | | |
New couchup 1.x -> 2.x database migration tool
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
This commit adds a new Python-based database migration tool, couchup.
It is intended to be used at the command-line on the server being
upgraded, before bringing the node (or cluster) into service.
couchup provides 4 subcommands to assist in the migration process:
* list - lists all CouchDB 1.x databases
* replicate - replicates one or more 1.x databases to CouchDB 2.x
* rebuild - rebuilds one or more CouchDB 2.x views
* delete - deletes one or more CouchDB 1.x databases
A typical workflow for a single-node upgrade process would look like:
```sh
$ couchdb list
$ couchdb replicate -a
$ couchdb rebuild -a
$ couchdb delete -a
```
A clustered upgrade process would be the same, but must be preceded by
setting up all the nodes in the cluster first.
Various optional arguments provide for admin login/password, overriding
ports, quiet mode and so on.
Of special note is that `couchup rebuild` supports an optional flag,
`-f`, to filter deleted documents during the replication process.
I struggled some with the naming convention. For those in the know, a
'1.x database' is a node-local database appearing only on port 5986, and
a '2.x database' is a clustered database appearing on port 5984, and in
raw, sharded form on port 5986.
|
|\ \ \ \
| |_|/ /
|/| | | |
Couchdb 3376 fix mem3 shards
|
| | | |
| | | |
| | | |
| | | | |
COUCHDB-3376
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
This change introduces a new shard_writer process into the mem3_shards
caching approach. It turns out its not terribly difficult to create
thundering herd scenarios that back up the mem3_shards mailbox. And if
the Q value is large this backup can happen quite quickly.
This changes things so that we use a temporary process to perform the
actual `ets:insert/2` call which keeps the shard map out of mem3_shards'
message queue.
A second optimization is that only a single client will attempt to send
the shard map to begin with by checking the existence of the writer key
using `ets:insert_new/2`.
COUCHDB-3376
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
There's a race condition in mem3_shards that can result in having shards
in the cache for a database that's been deleted. This results in a
confused cluster that thinks a database exists until you attempt to open
it.
The fix is to ignore any cache insert requests that come from an older
version of the dbs db than mem3_shards cache knows about.
Big thanks to @jdoane for the identification and original patch.
COUCHDB-3376
|
|\ \ \ \
| | | | |
| | | | |
| | | | |
| | | | | |
cloudant/COUCHDB-3174-re-enable-attachment-replication-tests
Re-enable attachment replication tests
|
|/ / / /
| | | |
| | | |
| | | |
| | | |
| | | | |
These tests now pass.
COUCHDB-3174
|
|\ \ \ \
| | | | |
| | | | | |
Fix _local_docs endpoint
|
| | | | | |
|
| | | | | |
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
We are using all_docs_reduce_to_count/1 in
_local_docs handler, but reductions got
from local btree are different from
reductions passed from id btree's enumerator.
This change converts passed local's KVs to
list of expected #doc_full_info records.
|
|/ / / / |
|
|\ \ \ \
| | | | |
| | | | | |
Avoid creation of document if deleting attachment on non-existent doc
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
- Check existence of document before deleting its attachment
- if document doesn’t exist, return 404 instead of creating new
document
Fixes COUCHDB-3362/FB 85549
|
| | | | | |
|
| | | | | |
|
| | | | | |
|