src/couch_replicator/README.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95

Developer Oriented Replicator Description
=========================================

This description of scheduling replicator's functionality is mainly geared to
CouchDB developers. It dives a bit into the internal and explains how
everything is connected together. A higher level overview is available in the
[RFC](https://github.com/apache/couchdb-documentation/pull/581). This
documention assumes the audience is familiar with that description as well as
with the [Couch Jobs
RFC](https://github.com/apache/couchdb-documentation/blob/master/rfcs/007-background-jobs.md)
as well as with the [Node Types
RFC](https://github.com/apache/couchdb-documentation/blob/master/rfcs/013-node-types.md).

A natural place to start is the top application supervisor:
`couch_replicator_sup`. The set of children in the supervisor is split into
`frontend` and `backend`. The `frontend` set is started on nodes which have the
`api_frontend` node type label set to `true`, and `backend` ones are started on
nodes which have the `replication` label set to `true`. The same node could
have both them set to `true`, and it could act as a replication front and
backend node. However, it is not guaranteed that jobs which are created by the
frontend part will necessarily run on the backend on the same node.


Frontend Description
--

The "frontend" consists of the parts which handle HTTP requests and monitor
`_replicator` databases for changes and then create `couch_jobs` replication
job records. Some of the modules involved in this are:

 * `couch_replicator` : Contains the main API "entry" point into the
   `couch_replicator` application. The `replicate/2` function creates transient
   replication jobs. `after_db_create/2`, `after_db_delete/2`,
   `after_doc_write/6` functions are called from `couch_epi` callbacks to
   create replication jobs from `_replicator` db events. Eventually they all
   call `couch_replicator_jobs:add_job/3` to create a `couch_jobs` replication
   job. Before the job is created, either the HTTP request body or the
   `_replicator` doc body is parsed into a `Rep` map object. An important
   property of this object is that it can be serialized to JSON and
   deserialized from JSON. This object is saved in the `?REP` field of the
   replication `couch_jobs` job data. Besides creating replication job
   `couch_replicator` is also responsible for handling `_scheduler/jobs` and
   `_scheduler/docs` monitoring API response. That happens in the `jobs/0`,
   `job/1`, `docs/` and `doc/2` function.

Backend Description
--

The "backend" consists of parts which run replication jobs, update their state,
and handle rescheduling on intermettent errors. All the job activity on these
nodes is ultumately driven from `couch_jobs` acceptors which wait in
`couch_jobs:accept/2` for replication jobs.

 * `couch_replicator_job_server` : A singleton process in charge of which
   spawning and keeping track of `couch_replicator_job` processes. It ensures
   there is a limited number of replication jobs running on each node. It
   periodically accepts new jobs and stopping the oldest running ones in order
   to give other pending jobs a chance to run. It runs this logic in the
   `reschedule/1` function. That function is called with a frequency defined by
   the `interval_sec` configuration setting. The other pramers which determine
   how jobs start and stop are `max_jobs` and `max_churn`. The node will try to
   limit running up to `max_jobs` job on average with periodic spikes of up to
   `max_jobs + max_churn` job at a time, and it will try not to start more than
   `max_churn` number of job during each rescheduling cycle.

 * `couch_replicator_connection`: Maintains a global replication connection
    pool. It allows reusing connections across replication tasks. The main
    interface is `acquire/1` and `release/1`. The general idea is once a
    connection is established, it is kept around for
    `replicator.connection_close_interval` milliseconds in case another
    replication task wants to re-use it. It is worth pointing out how linking
    and monitoring is handled: workers are linked to the connection pool when
    they are created. If they crash, the connection pool will receive an 'EXIT'
    event and clean up after the worker. The connection pool also monitors
    owners (by monitoring the `Pid` from the `From` argument in the call to
    `acquire/1`) and cleans up if owner dies, and the pool receives a 'DOWN'
    message. Another interesting thing is that connection establishment
    (creation) happens in the owner process so the pool is not blocked on it.

 * `couch_replicator_rate_limiter`: Implements a rate limiter to handle
    connection throttling from sources or targets where requests return 429
    error codes. Uses the Additive Increase / Multiplicative Decrease feedback
    control algorithm to converge on the channel capacity. Implemented using a
    16-way sharded ETS table to maintain connection state. The table sharding
    code is split out to `couch_replicator_rate_limiter_tables` module. The
    purpose of the module it to maintain and continually estimate sleep
    intervals for each connection represented as a `{Method, Url}` pair. The
    interval is updated accordingly on each call to `failure/1` or `success/1`
    calls. For a successful request, a client should call `success/1`. Whenever
    a 429 response is received the client should call `failure/1`. When no
    failures are happening the code ensures the ETS tables are empty in order
    to have a lower impact on a running system.