diff options
author | Robert Speicher <rspeicher@gmail.com> | 2021-01-20 13:34:23 -0600 |
---|---|---|
committer | Robert Speicher <rspeicher@gmail.com> | 2021-01-20 13:34:23 -0600 |
commit | 6438df3a1e0fb944485cebf07976160184697d72 (patch) | |
tree | 00b09bfd170e77ae9391b1a2f5a93ef6839f2597 /doc/administration/postgresql | |
parent | 42bcd54d971da7ef2854b896a7b34f4ef8601067 (diff) | |
download | gitlab-ce-6438df3a1e0fb944485cebf07976160184697d72.tar.gz |
Add latest changes from gitlab-org/gitlab@13-8-stable-eev13.8.0-rc42
Diffstat (limited to 'doc/administration/postgresql')
-rw-r--r-- | doc/administration/postgresql/pgbouncer.md | 11 | ||||
-rw-r--r-- | doc/administration/postgresql/replication_and_failover.md | 1186 |
2 files changed, 596 insertions, 601 deletions
diff --git a/doc/administration/postgresql/pgbouncer.md b/doc/administration/postgresql/pgbouncer.md index f09ac3052f4..951edfeaec2 100644 --- a/doc/administration/postgresql/pgbouncer.md +++ b/doc/administration/postgresql/pgbouncer.md @@ -66,6 +66,12 @@ This content has been moved to a [new location](replication_and_failover.md#conf 1. At this point, your instance should connect to the database through PgBouncer. If you are having issues, see the [Troubleshooting](#troubleshooting) section +## Backups + +Do not backup or restore GitLab through a PgBouncer connection: this will cause a GitLab outage. + +[Read more about this and how to reconfigure backups](../../raketasks/backup_restore.md#backup-and-restore-for-installations-using-pgbouncer). + ## Enable Monitoring > [Introduced](https://gitlab.com/gitlab-org/omnibus-gitlab/-/issues/3786) in GitLab 12.0. @@ -150,7 +156,10 @@ ote_pid | tls ## Procedure for bypassing PgBouncer -Some database changes have to be done directly, and not through PgBouncer. This includes database restores and GitLab upgrades (because of the database migrations). +Some database changes have to be done directly, and not through PgBouncer. + +Read more about the affected tasks: [database restores](../../raketasks/backup_restore.md#backup-and-restore-for-installations-using-pgbouncer) +and [GitLab upgrades](https://docs.gitlab.com/omnibus/update/README.html#use-postgresql-ha). 1. To find the primary node, run the following on a database node: diff --git a/doc/administration/postgresql/replication_and_failover.md b/doc/administration/postgresql/replication_and_failover.md index 303246c9c82..75d0c558962 100644 --- a/doc/administration/postgresql/replication_and_failover.md +++ b/doc/administration/postgresql/replication_and_failover.md @@ -46,22 +46,19 @@ Each database node runs three services: `PostgreSQL` - The database itself. -`repmgrd` - Communicates with other repmgrd services in the cluster and handles -failover when issues with the master server occurs. The failover procedure +`Patroni` - Communicates with other patroni services in the cluster and handles +failover when issues with the leader server occurs. The failover procedure consists of: -- Selecting a new master for the cluster. -- Promoting the new node to master. -- Instructing remaining servers to follow the new master node. -- The old master node is automatically evicted from the cluster and should be - rejoined manually once recovered. +- Selecting a new leader for the cluster. +- Promoting the new node to leader. +- Instructing remaining servers to follow the new leader node. -`Consul` agent - Monitors the status of each node in the database cluster and -tracks its health in a service definition on the Consul cluster. +`Consul` agent - To communicate with Consul cluster which stores the current Patroni state. The agent monitors the status of each node in the database cluster and tracks its health in a service definition on the Consul cluster. ### Consul server node -The Consul server node runs the Consul server service. +The Consul server node runs the Consul server service. These nodes must have reached the quorum and elected a leader _before_ Patroni cluster bootstrap otherwise database nodes will wait until such Consul leader is elected. ### PgBouncer node @@ -80,7 +77,7 @@ Each service in the package comes with a set of [default ports](https://docs.git - Application servers connect to either PgBouncer directly via its [default port](https://docs.gitlab.com/omnibus/package-information/defaults.html#pgbouncer) or via a configured Internal Load Balancer (TCP) that serves multiple PgBouncers. - PgBouncer connects to the primary database servers [PostgreSQL default port](https://docs.gitlab.com/omnibus/package-information/defaults.html#postgresql) -- Repmgr connects to the database servers [PostgreSQL default port](https://docs.gitlab.com/omnibus/package-information/defaults.html#postgresql) +- Patroni actively manages the running PostgreSQL processes and configuration. - PostgreSQL secondaries connect to the primary database servers [PostgreSQL default port](https://docs.gitlab.com/omnibus/package-information/defaults.html#postgresql) - Consul servers and agents connect to each others [Consul default ports](https://docs.gitlab.com/omnibus/package-information/defaults.html#consul) @@ -141,7 +138,7 @@ available database connections. In this document we are assuming 3 database nodes, which makes this configuration: ```ruby -postgresql['max_wal_senders'] = 4 +patroni['postgresql']['max_wal_senders'] = 4 ``` As previously mentioned, you'll have to prepare the network subnets that will @@ -186,18 +183,6 @@ Few notes on the service itself: - `/etc/gitlab/gitlab.rb`: hashed, and in plain text - `/var/opt/gitlab/pgbouncer/pg_auth`: hashed -#### Repmgr information - -When using default setup, you will only have to prepare the network subnets that will -be allowed to authenticate with the service. - -Few notes on the service itself: - -- The service runs under the same system account as the database - - In the package, this is by default `gitlab-psql` -- The service will have a superuser database user account generated for it - - This defaults to `gitlab_repmgr` - ### Installing Omnibus GitLab First, make sure to [download/install](https://about.gitlab.com/install/) @@ -212,72 +197,80 @@ When installing the GitLab package, do not supply `EXTERNAL_URL` value. 1. Make sure to [configure the Consul nodes](../consul.md). 1. Make sure you collect [`CONSUL_SERVER_NODES`](#consul-information), [`PGBOUNCER_PASSWORD_HASH`](#pgbouncer-information), [`POSTGRESQL_PASSWORD_HASH`](#postgresql-information), the [number of db nodes](#postgresql-information), and the [network address](#network-information) before executing the next step. -1. On the master database node, edit `/etc/gitlab/gitlab.rb` replacing values noted in the `# START user configuration` section: +#### Configuring Patroni cluster - ```ruby - # Disable all components except PostgreSQL and Repmgr and Consul - roles ['postgres_role'] +You must enable Patroni explicitly to be able to use it (with `patroni['enable'] = true`). When Patroni is enabled +repmgr will be disabled automatically. - # PostgreSQL configuration - postgresql['listen_address'] = '0.0.0.0' - postgresql['hot_standby'] = 'on' - postgresql['wal_level'] = 'replica' - postgresql['shared_preload_libraries'] = 'repmgr_funcs' +Any PostgreSQL configuration item that controls replication, for example `wal_level`, `max_wal_senders`, etc, are strictly +controlled by Patroni and will override the original settings that you make with the `postgresql[...]` configuration key. +Hence, they are all separated and placed under `patroni['postgresql'][...]`. This behavior is limited to replication. +Patroni honours any other PostgreSQL configuration that was made with the `postgresql[...]` configuration key. For example, +`max_wal_senders` by default is set to `5`. If you wish to change this you must set it with the `patroni['postgresql']['max_wal_senders']` +configuration key. - # Disable automatic database migrations - gitlab_rails['auto_migrate'] = false +NOTE: +The configuration of a Patroni node is very similar to a repmgr but shorter. When Patroni is enabled, first you can ignore +any replication setting of PostgreSQL (it will be overwritten anyway). Then you can remove any `repmgr[...]` or +repmgr-specific configuration as well. Especially, make sure that you remove `postgresql['shared_preload_libraries'] = 'repmgr_funcs'`. - # Configure the Consul agent - consul['services'] = %w(postgresql) +Here is an example similar to [the one that was done with repmgr](#configuring-repmgr-nodes): - # START user configuration - # Please set the real values as explained in Required Information section - # - # Replace PGBOUNCER_PASSWORD_HASH with a generated md5 value - postgresql['pgbouncer_user_password'] = 'PGBOUNCER_PASSWORD_HASH' - # Replace POSTGRESQL_PASSWORD_HASH with a generated md5 value - postgresql['sql_user_password'] = 'POSTGRESQL_PASSWORD_HASH' - # Replace X with value of number of db nodes + 1 - postgresql['max_wal_senders'] = X - postgresql['max_replication_slots'] = X +```ruby +# Disable all components except PostgreSQL, Patroni (or Repmgr), and Consul +roles['postgres_role'] - # Replace XXX.XXX.XXX.XXX/YY with Network Address - postgresql['trust_auth_cidr_addresses'] = %w(XXX.XXX.XXX.XXX/YY) - repmgr['trust_auth_cidr_addresses'] = %w(127.0.0.1/32 XXX.XXX.XXX.XXX/YY) +# Enable Patroni (which automatically disables Repmgr). +patroni['enable'] = true - # Replace placeholders: - # - # Y.Y.Y.Y consul1.gitlab.example.com Z.Z.Z.Z - # with the addresses gathered for CONSUL_SERVER_NODES - consul['configuration'] = { - retry_join: %w(Y.Y.Y.Y consul1.gitlab.example.com Z.Z.Z.Z) - } - # - # END user configuration - ``` +# PostgreSQL configuration +postgresql['listen_address'] = '0.0.0.0' - > `postgres_role` was introduced with GitLab 10.3 +# Disable automatic database migrations +gitlab_rails['auto_migrate'] = false -1. On secondary nodes, add all the configuration specified above for primary node - to `/etc/gitlab/gitlab.rb`. In addition, append the following configuration - to inform `gitlab-ctl` that they are standby nodes initially and it need not - attempt to register them as primary node +# Configure the Consul agent +consul['services'] = %w(postgresql) - ```ruby - # Specify if a node should attempt to be master on initialization - repmgr['master_on_initialization'] = false - ``` +# START user configuration +# Please set the real values as explained in Required Information section +# +# Replace PGBOUNCER_PASSWORD_HASH with a generated md5 value +postgresql['pgbouncer_user_password'] = 'PGBOUNCER_PASSWORD_HASH' +# Replace POSTGRESQL_PASSWORD_HASH with a generated md5 value +postgresql['sql_user_password'] = 'POSTGRESQL_PASSWORD_HASH' -1. [Reconfigure GitLab](../restart_gitlab.md#omnibus-gitlab-reconfigure) for the changes to take effect. -1. [Enable Monitoring](#enable-monitoring) +# Replace X with value of number of db nodes + 1 (OPTIONAL the default value is 5) +patroni['postgresql']['max_wal_senders'] = X +patroni['postgresql']['max_replication_slots'] = X -> Please note: -> -> - If you want your database to listen on a specific interface, change the configuration: -> `postgresql['listen_address'] = '0.0.0.0'`. -> - If your PgBouncer service runs under a different user account, -> you also need to specify: `postgresql['pgbouncer_user'] = PGBOUNCER_USERNAME` in -> your configuration. +# Replace XXX.XXX.XXX.XXX/YY with Network Address +postgresql['trust_auth_cidr_addresses'] = %w(XXX.XXX.XXX.XXX/YY) + +# Replace placeholders: +# +# Y.Y.Y.Y consul1.gitlab.example.com Z.Z.Z.Z +# with the addresses gathered for CONSUL_SERVER_NODES +consul['configuration'] = { + retry_join: %w(Y.Y.Y.Y consul1.gitlab.example.com Z.Z.Z.Z) +} +# +# END user configuration +``` + +You do not need an additional or different configuration for replica nodes. As a matter of fact, you don't have to have +a predetermined primary node. Therefore all database nodes use the same configuration. + +Once the configuration of a node is done, you must [reconfigure Omnibus GitLab](../restart_gitlab.md#omnibus-gitlab-reconfigure) +on each node for the changes to take effect. + +Generally, when Consul cluster is ready, the first node that [reconfigures](../restart_gitlab.md#omnibus-gitlab-reconfigure) +becomes the leader. You do not need to sequence the nodes reconfiguration. You can run them in parallel or in any order. +If you choose an arbitrary order you do not have any predetermined master. + +NOTE: +As opposed to repmgr, once the nodes are reconfigured you do not need any further action or additional command to join +the replicas. #### Enable Monitoring @@ -298,129 +291,6 @@ If you enable Monitoring, it must be enabled on **all** database servers. 1. Run `sudo gitlab-ctl reconfigure` to compile the configuration. -#### Database nodes post-configuration - -##### Primary node - -Select one node as a primary node. - -1. Open a database prompt: - - ```shell - gitlab-psql -d gitlabhq_production - ``` - -1. Enable the `pg_trgm` extension: - - ```shell - CREATE EXTENSION pg_trgm; - ``` - -1. Enable the `btree_gist` extension: - - ```shell - CREATE EXTENSION btree_gist; - ``` - -1. Exit the database prompt by typing `\q` and Enter. - -1. Verify the cluster is initialized with one node: - - ```shell - gitlab-ctl repmgr cluster show - ``` - - The output should be similar to the following: - - ```plaintext - Role | Name | Upstream | Connection String - ----------+----------|----------|---------------------------------------- - * master | HOSTNAME | | host=HOSTNAME user=gitlab_repmgr dbname=gitlab_repmgr - ``` - -1. Note down the hostname or IP address in the connection string: `host=HOSTNAME`. We will - refer to the hostname in the next section as `MASTER_NODE_NAME`. If the value - is not an IP address, it will need to be a resolvable name (via DNS or - `/etc/hosts`) - -##### Secondary nodes - -1. Set up the repmgr standby: - - ```shell - gitlab-ctl repmgr standby setup MASTER_NODE_NAME - ``` - - Do note that this will remove the existing data on the node. The command - has a wait time. - - The output should be similar to the following: - - ```console - # gitlab-ctl repmgr standby setup MASTER_NODE_NAME - Doing this will delete the entire contents of /var/opt/gitlab/postgresql/data - If this is not what you want, hit Ctrl-C now to exit - To skip waiting, rerun with the -w option - Sleeping for 30 seconds - Stopping the database - Removing the data - Cloning the data - Starting the database - Registering the node with the cluster - ok: run: repmgrd: (pid 19068) 0s - ``` - -1. Verify the node now appears in the cluster: - - ```shell - gitlab-ctl repmgr cluster show - ``` - - The output should be similar to the following: - - ```plaintext - Role | Name | Upstream | Connection String - ----------+---------|-----------|------------------------------------------------ - * master | MASTER | | host=MASTER_NODE_NAME user=gitlab_repmgr dbname=gitlab_repmgr - standby | STANDBY | MASTER | host=STANDBY_HOSTNAME user=gitlab_repmgr dbname=gitlab_repmgr - ``` - -Repeat the above steps on all secondary nodes. - -#### Database checkpoint - -Before moving on, make sure the databases are configured correctly. Run the -following command on the **primary** node to verify that replication is working -properly: - -```shell -gitlab-ctl repmgr cluster show -``` - -The output should be similar to: - -```plaintext -Role | Name | Upstream | Connection String -----------+--------------|--------------|-------------------------------------------------------------------- -* master | MASTER | | host=MASTER port=5432 user=gitlab_repmgr dbname=gitlab_repmgr - standby | STANDBY | MASTER | host=STANDBY port=5432 user=gitlab_repmgr dbname=gitlab_repmgr -``` - -If the 'Role' column for any node says "FAILED", check the -[Troubleshooting section](#troubleshooting) before proceeding. - -Also, check that the check master command works successfully on each node: - -```shell -su - gitlab-consul -gitlab-ctl repmgr-check-master || echo 'This node is a standby repmgr node' -``` - -This command relies on exit codes to tell Consul whether a particular node is a master -or secondary. The most important thing here is that this command does not produce errors. -If there are errors it's most likely due to incorrect `gitlab-consul` database user permissions. -Check the [Troubleshooting section](#troubleshooting) before proceeding. - ### Configuring the PgBouncer node 1. Make sure you collect [`CONSUL_SERVER_NODES`](#consul-information), [`CONSUL_PASSWORD_HASH`](#consul-information), and [`PGBOUNCER_PASSWORD_HASH`](#pgbouncer-information) before executing the next step. @@ -577,6 +447,12 @@ PostgreSQL's `trust_auth_cidr_addresses` in `gitlab.rb` on your database nodes. [PgBouncer error `ERROR: pgbouncer cannot connect to server`](#pgbouncer-error-error-pgbouncer-cannot-connect-to-server) in the Troubleshooting section before proceeding. +### Backups + +Do not backup or restore GitLab through a PgBouncer connection: this will cause a GitLab outage. + +[Read more about this and how to reconfigure backups](../../raketasks/backup_restore.md#backup-and-restore-for-installations-using-pgbouncer). + ### Ensure GitLab is running At this point, your GitLab instance should be up and running. Verify you're able @@ -605,9 +481,9 @@ Here is a list and description of each machine and the assigned IP: - `10.6.0.21`: PgBouncer 1 - `10.6.0.22`: PgBouncer 2 - `10.6.0.23`: PgBouncer 3 -- `10.6.0.31`: PostgreSQL master -- `10.6.0.32`: PostgreSQL secondary -- `10.6.0.33`: PostgreSQL secondary +- `10.6.0.31`: PostgreSQL 1 +- `10.6.0.32`: PostgreSQL 2 +- `10.6.0.33`: PostgreSQL 3 - `10.6.0.41`: GitLab application All passwords are set to `toomanysecrets`, please do not use this password or derived hashes and the `external_url` for GitLab is `http://gitlab.example.com`. @@ -667,29 +543,28 @@ An internal load balancer (TCP) is then required to be setup to serve each PgBou #### Example recommended setup for PostgreSQL servers -##### Primary node - -On primary node edit `/etc/gitlab/gitlab.rb`: +On database nodes edit `/etc/gitlab/gitlab.rb`: ```ruby -# Disable all components except PostgreSQL and Repmgr and Consul +# Disable all components except PostgreSQL, Patroni (or Repmgr), and Consul roles ['postgres_role'] # PostgreSQL configuration postgresql['listen_address'] = '0.0.0.0' postgresql['hot_standby'] = 'on' postgresql['wal_level'] = 'replica' -postgresql['shared_preload_libraries'] = 'repmgr_funcs' + +# Enable Patroni (which automatically disables Repmgr). +patroni['enable'] = true # Disable automatic database migrations gitlab_rails['auto_migrate'] = false postgresql['pgbouncer_user_password'] = '771a8625958a529132abe6f1a4acb19c' postgresql['sql_user_password'] = '450409b85a0223a214b5fb1484f34d0f' -postgresql['max_wal_senders'] = 4 +patroni['postgresql']['max_wal_senders'] = 4 postgresql['trust_auth_cidr_addresses'] = %w(10.6.0.0/16) -repmgr['trust_auth_cidr_addresses'] = %w(10.6.0.0/16) # Configure the Consul agent consul['services'] = %w(postgresql) @@ -702,85 +577,27 @@ consul['monitoring_service_discovery'] = true [Reconfigure Omnibus GitLab](../restart_gitlab.md#omnibus-gitlab-reconfigure) for the changes to take effect. -##### Secondary nodes - -On secondary nodes, edit `/etc/gitlab/gitlab.rb` and add all the configuration -added to primary node, noted above. In addition, append the following -configuration: - -```ruby -# Specify if a node should attempt to be master on initialization -repmgr['master_on_initialization'] = false -``` - -[Reconfigure Omnibus GitLab](../restart_gitlab.md#omnibus-gitlab-reconfigure) for the changes to take effect. - -###### Example recommended setup for application server - -On the server edit `/etc/gitlab/gitlab.rb`: - -```ruby -external_url 'http://gitlab.example.com' - -gitlab_rails['db_host'] = '10.6.0.20' # Internal Load Balancer for PgBouncer nodes -gitlab_rails['db_port'] = 6432 -gitlab_rails['db_password'] = 'toomanysecrets' -gitlab_rails['auto_migrate'] = false - -postgresql['enable'] = false -pgbouncer['enable'] = false -consul['enable'] = true - -# Configure Consul agent -consul['watchers'] = %w(postgresql) - -pgbouncer['users'] = { - 'gitlab-consul': { - password: '5e0e3263571e3704ad655076301d6ebe' - }, - 'pgbouncer': { - password: '771a8625958a529132abe6f1a4acb19c' - } -} - -consul['configuration'] = { - retry_join: %w(10.6.0.11 10.6.0.12 10.6.0.13) -} -``` - -[Reconfigure Omnibus GitLab](../restart_gitlab.md#omnibus-gitlab-reconfigure) for the changes to take effect. - #### Example recommended setup manual steps After deploying the configuration follow these steps: -1. On `10.6.0.31`, our primary database: - - Enable the `pg_trgm` and `btree_gist` extensions: - - ```shell - gitlab-psql -d gitlabhq_production - ``` +1. Find the primary database node: ```shell - CREATE EXTENSION pg_trgm; - CREATE EXTENSION btree_gist; + gitlab-ctl get-postgresql-primary ``` -1. On `10.6.0.32`, our first standby database: +1. On the primary database node: - Make this node a standby of the primary: + Enable the `pg_trgm` and `btree_gist` extensions: ```shell - gitlab-ctl repmgr standby setup 10.6.0.21 + gitlab-psql -d gitlabhq_production ``` -1. On `10.6.0.33`, our second standby database: - - Make this node a standby of the primary: - ```shell - gitlab-ctl repmgr standby setup 10.6.0.21 + CREATE EXTENSION pg_trgm; + CREATE EXTENSION btree_gist; ``` 1. On `10.6.0.41`, our application server: @@ -802,15 +619,15 @@ After deploying the configuration follow these steps: This example uses 3 PostgreSQL servers, and 1 application node (with PgBouncer setup alongside). It differs from the [recommended setup](#example-recommended-setup) by moving the Consul servers into the same servers we use for PostgreSQL. -The trade-off is between reducing server counts, against the increased operational complexity of needing to deal with PostgreSQL [failover](#failover-procedure) and [restore](#restore-procedure) procedures in addition to [Consul outage recovery](../consul.md#outage-recovery) on the same set of machines. +The trade-off is between reducing server counts, against the increased operational complexity of needing to deal with PostgreSQL [failover](#manual-failover-procedure-for-patroni) procedures in addition to [Consul outage recovery](../consul.md#outage-recovery) on the same set of machines. In this example we start with all servers on the same 10.6.0.0/16 private network range, they can connect to each freely other on those addresses. Here is a list and description of each machine and the assigned IP: -- `10.6.0.21`: PostgreSQL master -- `10.6.0.22`: PostgreSQL secondary -- `10.6.0.23`: PostgreSQL secondary +- `10.6.0.21`: PostgreSQL 1 +- `10.6.0.22`: PostgreSQL 2 +- `10.6.0.23`: PostgreSQL 3 - `10.6.0.31`: GitLab application All passwords are set to `toomanysecrets`, please do not use this password or derived hashes. @@ -821,9 +638,7 @@ Please note that after the initial configuration, if a failover occurs, the Post #### Example minimal configuration for database servers -##### Primary node - -On primary database node edit `/etc/gitlab/gitlab.rb`: +On database nodes edit `/etc/gitlab/gitlab.rb`: ```ruby # Disable all components except PostgreSQL, Repmgr, and Consul @@ -833,7 +648,9 @@ roles ['postgres_role'] postgresql['listen_address'] = '0.0.0.0' postgresql['hot_standby'] = 'on' postgresql['wal_level'] = 'replica' -postgresql['shared_preload_libraries'] = 'repmgr_funcs' + +# Enable Patroni (which automatically disables Repmgr). +patroni['enable'] = true # Disable automatic database migrations gitlab_rails['auto_migrate'] = false @@ -843,10 +660,9 @@ consul['services'] = %w(postgresql) postgresql['pgbouncer_user_password'] = '771a8625958a529132abe6f1a4acb19c' postgresql['sql_user_password'] = '450409b85a0223a214b5fb1484f34d0f' -postgresql['max_wal_senders'] = 4 +patroni['postgresql']['max_wal_senders'] = 4 postgresql['trust_auth_cidr_addresses'] = %w(10.6.0.0/16) -repmgr['trust_auth_cidr_addresses'] = %w(10.6.0.0/16) consul['configuration'] = { server: true, @@ -856,16 +672,6 @@ consul['configuration'] = { [Reconfigure Omnibus GitLab](../restart_gitlab.md#omnibus-gitlab-reconfigure) for the changes to take effect. -##### Secondary nodes - -On secondary nodes, edit `/etc/gitlab/gitlab.rb` and add all the information added -to primary node, noted above. In addition, append the following configuration - -```ruby -# Specify if a node should attempt to be master on initialization -repmgr['master_on_initialization'] = false -``` - #### Example minimal configuration for application server On the server edit `/etc/gitlab/gitlab.rb`: @@ -908,249 +714,10 @@ consul['configuration'] = { The manual steps for this configuration are the same as for the [example recommended setup](#example-recommended-setup-manual-steps). -### Failover procedure - -By default, if the master database fails, `repmgrd` should promote one of the -standby nodes to master automatically, and Consul will update PgBouncer with -the new master. - -If you need to failover manually, you have two options: - -**Shutdown the current master database** - -Run: - -```shell -gitlab-ctl stop postgresql -``` - -The automated failover process will see this and failover to one of the -standby nodes. - -**Or perform a manual failover** - -1. Ensure the old master node is not still active. -1. Login to the server that should become the new master and run: - - ```shell - gitlab-ctl repmgr standby promote - ``` - -1. If there are any other standby servers in the cluster, have them follow - the new master server: - - ```shell - gitlab-ctl repmgr standby follow NEW_MASTER - ``` - -### Restore procedure - -If a node fails, it can be removed from the cluster, or added back as a standby -after it has been restored to service. - -#### Remove a standby from the cluster - - From any other node in the cluster, run: - - ```shell - gitlab-ctl repmgr standby unregister --node=X - ``` - - where X is the value of node in `repmgr.conf` on the old server. - - To find this, you can use: - - ```shell - awk -F = '$1 == "node" { print $2 }' /var/opt/gitlab/postgresql/repmgr.conf - ``` - - It will output something like: - - ```plaintext - 959789412 - ``` - - Then you will use this ID to unregister the node: - - ```shell - gitlab-ctl repmgr standby unregister --node=959789412 - ``` - -#### Add a node as a standby server - - From the standby node, run: - - ```shell - gitlab-ctl repmgr standby follow NEW_MASTER - gitlab-ctl restart repmgrd - ``` - - WARNING: - When the server is brought back online, and before - you switch it to a standby node, repmgr will report that there are two masters. - If there are any clients that are still attempting to write to the old master, - this will cause a split, and the old master will need to be resynced from - scratch by performing a `gitlab-ctl repmgr standby setup NEW_MASTER`. - -#### Add a failed master back into the cluster as a standby node - - Once `repmgrd` and PostgreSQL are running, the node will need to follow the new - as a standby node. - - ```shell - gitlab-ctl repmgr standby follow NEW_MASTER - ``` - - Once the node is following the new master as a standby, the node needs to be - [unregistered from the cluster on the new master node](#remove-a-standby-from-the-cluster). - - Once the old master node has been unregistered from the cluster, it will need - to be setup as a new standby: - - ```shell - gitlab-ctl repmgr standby setup NEW_MASTER - ``` - - Failure to unregister and read the old master node can lead to subsequent failovers - not working. - -### Alternate configurations - -#### Database authorization - -By default, we give any host on the database network the permission to perform -repmgr operations using PostgreSQL's `trust` method. If you do not want this -level of trust, there are alternatives. - -You can trust only the specific nodes that will be database clusters, or you -can require md5 authentication. - -#### Trust specific addresses - -If you know the IP address, or FQDN of all database and PgBouncer nodes in the -cluster, you can trust only those nodes. - -In `/etc/gitlab/gitlab.rb` on all of the database nodes, set -`repmgr['trust_auth_cidr_addresses']` to an array of strings containing all of -the addresses. - -If setting to a node's FQDN, they must have a corresponding PTR record in DNS. -If setting to a node's IP address, specify it as `XXX.XXX.XXX.XXX/32`. - -For example: - -```ruby -repmgr['trust_auth_cidr_addresses'] = %w(192.168.1.44/32 db2.example.com) -``` - -#### MD5 Authentication - -If you are running on an untrusted network, repmgr can use md5 authentication -with a [`.pgpass` file](https://www.postgresql.org/docs/11/libpq-pgpass.html) -to authenticate. - -You can specify by IP address, FQDN, or by subnet, using the same format as in -the previous section: - -1. On the current master node, create a password for the `gitlab` and - `gitlab_repmgr` user: - - ```shell - gitlab-psql -d template1 - template1=# \password gitlab_repmgr - Enter password: **** - Confirm password: **** - template1=# \password gitlab - ``` - -1. On each database node: - - 1. Edit `/etc/gitlab/gitlab.rb`: - 1. Ensure `repmgr['trust_auth_cidr_addresses']` is **not** set - 1. Set `postgresql['md5_auth_cidr_addresses']` to the desired value - 1. Set `postgresql['sql_replication_user'] = 'gitlab_repmgr'` - 1. Reconfigure with `gitlab-ctl reconfigure` - 1. Restart PostgreSQL with `gitlab-ctl restart postgresql` - - 1. Create a `.pgpass` file. Enter the `gitlab_repmgr` password twice to - when asked: - - ```shell - gitlab-ctl write-pgpass --user gitlab_repmgr --hostuser gitlab-psql --database '*' - ``` - -1. On each PgBouncer node, edit `/etc/gitlab/gitlab.rb`: - 1. Ensure `gitlab_rails['db_password']` is set to the plaintext password for - the `gitlab` database user - 1. [Reconfigure GitLab](../restart_gitlab.md#omnibus-gitlab-reconfigure) for the changes to take effect - -## Troubleshooting - -### Consul and PostgreSQL changes not taking effect - -Due to the potential impacts, `gitlab-ctl reconfigure` only reloads Consul and PostgreSQL, it will not restart the services. However, not all changes can be activated by reloading. - -To restart either service, run `gitlab-ctl restart SERVICE` - -For PostgreSQL, it is usually safe to restart the master node by default. Automatic failover defaults to a 1 minute timeout. Provided the database returns before then, nothing else needs to be done. To be safe, you can stop `repmgrd` on the standby nodes first with `gitlab-ctl stop repmgrd`, then start afterwards with `gitlab-ctl start repmgrd`. - -On the Consul server nodes, it is important to [restart the Consul service](../consul.md#restart-consul) in a controlled manner. - -### `gitlab-ctl repmgr-check-master` command produces errors - -If this command displays errors about database permissions it is likely that something failed during -install, resulting in the `gitlab-consul` database user getting incorrect permissions. Follow these -steps to fix the problem: - -1. On the master database node, connect to the database prompt - `gitlab-psql -d template1` -1. Delete the `gitlab-consul` user - `DROP USER "gitlab-consul";` -1. Exit the database prompt - `\q` -1. [Reconfigure GitLab](../restart_gitlab.md#omnibus-gitlab-reconfigure) and the user will be re-added with the proper permissions. -1. Change to the `gitlab-consul` user - `su - gitlab-consul` -1. Try the check command again - `gitlab-ctl repmgr-check-master`. - -Now there should not be errors. If errors still occur then there is another problem. - -### PgBouncer error `ERROR: pgbouncer cannot connect to server` - -You may get this error when running `gitlab-rake gitlab:db:configure` or you -may see the error in the PgBouncer log file. - -```plaintext -PG::ConnectionBad: ERROR: pgbouncer cannot connect to server -``` - -The problem may be that your PgBouncer node's IP address is not included in the -`trust_auth_cidr_addresses` setting in `/etc/gitlab/gitlab.rb` on the database nodes. - -You can confirm that this is the issue by checking the PostgreSQL log on the master -database node. If you see the following error then `trust_auth_cidr_addresses` -is the problem. - -```plaintext -2018-03-29_13:59:12.11776 FATAL: no pg_hba.conf entry for host "123.123.123.123", user "pgbouncer", database "gitlabhq_production", SSL off -``` - -To fix the problem, add the IP address to `/etc/gitlab/gitlab.rb`. - -```ruby -postgresql['trust_auth_cidr_addresses'] = %w(123.123.123.123/32 <other_cidrs>) -``` - -[Reconfigure GitLab](../restart_gitlab.md#omnibus-gitlab-reconfigure) for the changes to take effect. - -### Issues with other components - -If you're running into an issue with a component not outlined here, be sure to check the troubleshooting section of their specific documentation page: - -- [Consul](../consul.md#troubleshooting-consul) -- [PostgreSQL](https://docs.gitlab.com/omnibus/settings/database.html#troubleshooting) - ## Patroni NOTE: -Starting from GitLab 13.1, Patroni is available for **experimental** use to replace repmgr. Due to its -experimental nature, Patroni support is **subject to change without notice.** +Using Patroni instead of Repmgr is supported for PostgreSQL 11 and required for PostgreSQL 12. Patroni is an opinionated solution for PostgreSQL high-availability. It takes the control of PostgreSQL, overrides its configuration and manages its lifecycle (start, stop, restart). This is a more active approach when compared to repmgr. @@ -1174,80 +741,44 @@ functional or does not have a leader, Patroni and by extension PostgreSQL will n API which can be accessed via its [default port](https://docs.gitlab.com/omnibus/package-information/defaults.html#patroni) on each node. -### Configuring Patroni cluster - -You must enable Patroni explicitly to be able to use it (with `patroni['enable'] = true`). When Patroni is enabled -repmgr will be disabled automatically. +### Selecting the appropriate Patroni replication method -Any PostgreSQL configuration item that controls replication, for example `wal_level`, `max_wal_senders`, etc, are strictly -controlled by Patroni and will override the original settings that you make with the `postgresql[...]` configuration key. -Hence, they are all separated and placed under `patroni['postgresql'][...]`. This behavior is limited to replication. -Patroni honours any other PostgreSQL configuration that was made with the `postgresql[...]` configuration key. For example, -`max_wal_senders` by default is set to `5`. If you wish to change this you must set it with the `patroni['postgresql']['max_wal_senders']` -configuration key. +[Review the Patroni documentation carefully](https://patroni.readthedocs.io/en/latest/SETTINGS.html#postgresql) +before making changes as **_some of the options carry a risk of potential data +loss if not fully understood_**. The [replication mode](https://patroni.readthedocs.io/en/latest/replication_modes.html) +configured determines the amount of tolerable data loss. -The configuration of Patroni node is very similar to a repmgr but shorter. When Patroni is enabled, first you can ignore -any replication setting of PostgreSQL (it will be overwritten anyway). Then you can remove any `repmgr[...]` or -repmgr-specific configuration as well. Especially, make sure that you remove `postgresql['shared_preload_libraries'] = 'repmgr_funcs'`. +WARNING: +Replication is not a backup strategy! There is no replacement for a well-considered and tested backup solution. -Here is an example similar to [the one that was done with repmgr](#configuring-the-database-nodes): +Omnibus GitLab defaults [`synchronous_commit`](https://www.postgresql.org/docs/11/runtime-config-wal.html#GUC-SYNCHRONOUS-COMMIT) to `on`. ```ruby -# Disable all components except PostgreSQL and Repmgr and Consul -roles['postgres_role'] +postgresql['synchronous_commit'] = 'on' +gitlab['geo-postgresql']['synchronous_commit'] = 'on' +``` -# Enable Patroni -patroni['enable'] = true +#### Customizing Patroni failover behavior -# PostgreSQL configuration -postgresql['listen_address'] = '0.0.0.0' +Omnibus GitLab exposes several options allowing more control over the [Patroni restoration process](#recovering-the-patroni-cluster). -# Disable automatic database migrations -gitlab_rails['auto_migrate'] = false +Each option is shown below with its default value in `/etc/gitlab/gitlab.rb`. -# Configure the Consul agent -consul['services'] = %w(postgresql) - -# START user configuration -# Please set the real values as explained in Required Information section -# -# Replace PGBOUNCER_PASSWORD_HASH with a generated md5 value -postgresql['pgbouncer_user_password'] = 'PGBOUNCER_PASSWORD_HASH' -# Replace POSTGRESQL_PASSWORD_HASH with a generated md5 value -postgresql['sql_user_password'] = 'POSTGRESQL_PASSWORD_HASH' - -# Replace X with value of number of db nodes + 1 (OPTIONAL the default value is 5) -patroni['postgresql']['max_wal_senders'] = X -patroni['postgresql']['max_replication_slots'] = X - -# Replace XXX.XXX.XXX.XXX/YY with Network Address -postgresql['trust_auth_cidr_addresses'] = %w(XXX.XXX.XXX.XXX/YY) - -# Replace placeholders: -# -# Y.Y.Y.Y consul1.gitlab.example.com Z.Z.Z.Z -# with the addresses gathered for CONSUL_SERVER_NODES -consul['configuration'] = { - retry_join: %w(Y.Y.Y.Y consul1.gitlab.example.com Z.Z.Z.Z) -} -# -# END user configuration +```ruby +patroni['use_pg_rewind'] = true +patroni['remove_data_directory_on_rewind_failure'] = false +patroni['remove_data_directory_on_diverged_timelines'] = false ``` -You do not need an additional or different configuration for replica nodes. As a matter of fact, you don't have to have -a predetermined primary node. Therefore all database nodes use the same configuration. +[The upstream documentation will always be more up to date](https://patroni.readthedocs.io/en/latest/SETTINGS.html#postgresql), but the table below should provide a minimal overview of functionality. -Once the configuration of a node is done, you must [reconfigure Omnibus GitLab](../restart_gitlab.md#omnibus-gitlab-reconfigure) -on each node for the changes to take effect. +|Setting|Overview| +|-|-| +|`use_pg_rewind`|Try running `pg_rewind` on the former cluster leader before it rejoins the database cluster.| +|`remove_data_directory_on_rewind_failure`|If `pg_rewind` fails, remove the local PostgreSQL data directory and re-replicate from the current cluster leader.| +|`remove_data_directory_on_diverged_timelines`|If `pg_rewind` cannot be used and the former leader's timeline has diverged from the current one, then delete the local data directory and re-replicate from the current cluster leader.| -Generally, when Consul cluster is ready, the first node that [reconfigures](../restart_gitlab.md#omnibus-gitlab-reconfigure) -becomes the leader. You do not need to sequence the nodes reconfiguration. You can run them in parallel or in any order. -If you choose an arbitrary order you do not have any predetermined master. - -As opposed to repmgr, once the nodes are reconfigured you do not need any further action or additional command to join -the replicas. - -#### Database authorization for Patroni +### Database authorization for Patroni Patroni uses Unix socket to manage PostgreSQL instance. Therefore, the connection from the `local` socket must be trusted. @@ -1297,6 +828,16 @@ a manual one, where you have two slightly different options: For further details on this subject, see the [Patroni documentation](https://patroni.readthedocs.io/en/latest/rest_api.html#switchover-and-failover-endpoints). +#### Geo secondary site considerations + +Similar to `repmgr`, when a Geo secondary site is replicating from a primary site that uses `Patroni` and `PgBouncer`, [replicating through PgBouncer is not supported](https://github.com/pgbouncer/pgbouncer/issues/382#issuecomment-517911529) and the secondary must replicate directly from the leader node in the `Patroni` cluster. Therefore, when there is an automatic or manual failover in the `Patroni` cluster, you will need to manually re-point your secondary site to replicate from the new leader with: + +```shell +sudo gitlab-ctl replicate-geo-database --host=<new_leader_ip> --replication-slot=<slot_name> +``` + +Otherwise, the replication will not happen anymore, even if the original node gets re-added as a follower node. This will re-sync your secondary site database and may take a long time depending on the amount of data to sync. You may also need to run `gitlab-ctl reconfigure` if replication is still not working after re-syncing. + ### Recovering the Patroni cluster To recover the old primary and rejoin it to the cluster as a replica, you can simply start Patroni with: @@ -1363,9 +904,9 @@ You can switch an exiting database cluster to use Patroni instead of repmgr with ### Upgrading PostgreSQL major version in a Patroni cluster -As of GitLab 13.3, PostgreSQL 11.7 and 12.3 are both shipped with Omnibus GitLab. GitLab still -uses PostgreSQL 11 by default. Therefore `gitlab-ctl pg-upgrade` does not automatically upgrade -to PostgreSQL 12. If you want to upgrade to PostgreSQL 12, you must ask for it explicitly. +As of GitLab 13.3, PostgreSQL 11.7 and 12.3 are both shipped with Omnibus GitLab, and as of GitLab 13.7 +PostgreSQL 12 is used by default. If you want to upgrade to PostgreSQL 12 in versions prior to GitLab 13.7, +you must ask for it explicitly. WARNING: The procedure for upgrading PostgreSQL in a Patroni cluster is different than when upgrading using repmgr. @@ -1450,3 +991,448 @@ NOTE: Reverting PostgreSQL upgrade with `gitlab-ctl revert-pg-upgrade` has the same considerations as `gitlab-ctl pg-upgrade`. You should follow the same procedure by first stopping the replicas, then reverting the leader, and finally reverting the replicas. + +## Repmgr + +NOTE: +Using Patroni instead of Repmgr is supported for PostgreSQL 11 and required for PostgreSQL 12. + +### Configuring Repmgr Nodes + +1. On the master database node, edit `/etc/gitlab/gitlab.rb` replacing values noted in the `# START user configuration` section: + + ```ruby + # Disable all components except PostgreSQL and Repmgr and Consul + roles ['postgres_role'] + + # PostgreSQL configuration + postgresql['listen_address'] = '0.0.0.0' + postgresql['hot_standby'] = 'on' + postgresql['wal_level'] = 'replica' + postgresql['shared_preload_libraries'] = 'repmgr_funcs' + + # Disable automatic database migrations + gitlab_rails['auto_migrate'] = false + + # Configure the Consul agent + consul['services'] = %w(postgresql) + + # START user configuration + # Please set the real values as explained in Required Information section + # + # Replace PGBOUNCER_PASSWORD_HASH with a generated md5 value + postgresql['pgbouncer_user_password'] = 'PGBOUNCER_PASSWORD_HASH' + # Replace POSTGRESQL_PASSWORD_HASH with a generated md5 value + postgresql['sql_user_password'] = 'POSTGRESQL_PASSWORD_HASH' + # Replace X with value of number of db nodes + 1 + postgresql['max_wal_senders'] = X + postgresql['max_replication_slots'] = X + + # Replace XXX.XXX.XXX.XXX/YY with Network Address + postgresql['trust_auth_cidr_addresses'] = %w(XXX.XXX.XXX.XXX/YY) + repmgr['trust_auth_cidr_addresses'] = %w(127.0.0.1/32 XXX.XXX.XXX.XXX/YY) + + # Replace placeholders: + # + # Y.Y.Y.Y consul1.gitlab.example.com Z.Z.Z.Z + # with the addresses gathered for CONSUL_SERVER_NODES + consul['configuration'] = { + retry_join: %w(Y.Y.Y.Y consul1.gitlab.example.com Z.Z.Z.Z) + } + # + # END user configuration + ``` + + > `postgres_role` was introduced with GitLab 10.3 + +1. On secondary nodes, add all the configuration specified above for primary node + to `/etc/gitlab/gitlab.rb`. In addition, append the following configuration + to inform `gitlab-ctl` that they are standby nodes initially and it need not + attempt to register them as primary node + + ```ruby + # Specify if a node should attempt to be master on initialization + repmgr['master_on_initialization'] = false + ``` + +1. [Reconfigure GitLab](../restart_gitlab.md#omnibus-gitlab-reconfigure) for the changes to take effect. +1. [Enable Monitoring](#enable-monitoring) + +> Please note: +> +> - If you want your database to listen on a specific interface, change the configuration: +> `postgresql['listen_address'] = '0.0.0.0'`. +> - If your PgBouncer service runs under a different user account, +> you also need to specify: `postgresql['pgbouncer_user'] = PGBOUNCER_USERNAME` in +> your configuration. + +#### Database nodes post-configuration + +##### Primary node + +Select one node as a primary node. + +1. Open a database prompt: + + ```shell + gitlab-psql -d gitlabhq_production + ``` + +1. Enable the `pg_trgm` extension: + + ```shell + CREATE EXTENSION pg_trgm; + ``` + +1. Enable the `btree_gist` extension: + + ```shell + CREATE EXTENSION btree_gist; + ``` + +1. Exit the database prompt by typing `\q` and Enter. + +1. Verify the cluster is initialized with one node: + + ```shell + gitlab-ctl repmgr cluster show + ``` + + The output should be similar to the following: + + ```plaintext + Role | Name | Upstream | Connection String + ----------+----------|----------|---------------------------------------- + * master | HOSTNAME | | host=HOSTNAME user=gitlab_repmgr dbname=gitlab_repmgr + ``` + +1. Note down the hostname or IP address in the connection string: `host=HOSTNAME`. We will + refer to the hostname in the next section as `MASTER_NODE_NAME`. If the value + is not an IP address, it will need to be a resolvable name (via DNS or + `/etc/hosts`) + +##### Secondary nodes + +1. Set up the repmgr standby: + + ```shell + gitlab-ctl repmgr standby setup MASTER_NODE_NAME + ``` + + Do note that this will remove the existing data on the node. The command + has a wait time. + + The output should be similar to the following: + + ```console + # gitlab-ctl repmgr standby setup MASTER_NODE_NAME + Doing this will delete the entire contents of /var/opt/gitlab/postgresql/data + If this is not what you want, hit Ctrl-C now to exit + To skip waiting, rerun with the -w option + Sleeping for 30 seconds + Stopping the database + Removing the data + Cloning the data + Starting the database + Registering the node with the cluster + ok: run: repmgrd: (pid 19068) 0s + ``` + +1. Verify the node now appears in the cluster: + + ```shell + gitlab-ctl repmgr cluster show + ``` + + The output should be similar to the following: + + ```plaintext + Role | Name | Upstream | Connection String + ----------+---------|-----------|------------------------------------------------ + * master | MASTER | | host=MASTER_NODE_NAME user=gitlab_repmgr dbname=gitlab_repmgr + standby | STANDBY | MASTER | host=STANDBY_HOSTNAME user=gitlab_repmgr dbname=gitlab_repmgr + ``` + +Repeat the above steps on all secondary nodes. + +#### Database checkpoint + +Before moving on, make sure the databases are configured correctly. Run the +following command on the **primary** node to verify that replication is working +properly: + +```shell +gitlab-ctl repmgr cluster show +``` + +The output should be similar to: + +```plaintext +Role | Name | Upstream | Connection String +----------+--------------|--------------|-------------------------------------------------------------------- +* master | MASTER | | host=MASTER port=5432 user=gitlab_repmgr dbname=gitlab_repmgr + standby | STANDBY | MASTER | host=STANDBY port=5432 user=gitlab_repmgr dbname=gitlab_repmgr +``` + +If the 'Role' column for any node says "FAILED", check the +[Troubleshooting section](#troubleshooting) before proceeding. + +Also, check that the check master command works successfully on each node: + +```shell +su - gitlab-consul +gitlab-ctl repmgr-check-master || echo 'This node is a standby repmgr node' +``` + +This command relies on exit codes to tell Consul whether a particular node is a master +or secondary. The most important thing here is that this command does not produce errors. +If there are errors it's most likely due to incorrect `gitlab-consul` database user permissions. +Check the [Troubleshooting section](#troubleshooting) before proceeding. + +### Repmgr failover procedure + +By default, if the master database fails, `repmgrd` should promote one of the +standby nodes to master automatically, and Consul will update PgBouncer with +the new master. + +If you need to failover manually, you have two options: + +**Shutdown the current master database** + +Run: + +```shell +gitlab-ctl stop postgresql +``` + +The automated failover process will see this and failover to one of the +standby nodes. + +**Or perform a manual failover** + +1. Ensure the old master node is not still active. +1. Login to the server that should become the new master and run: + + ```shell + gitlab-ctl repmgr standby promote + ``` + +1. If there are any other standby servers in the cluster, have them follow + the new master server: + + ```shell + gitlab-ctl repmgr standby follow NEW_MASTER + ``` + +#### Geo secondary site considerations + +When a Geo secondary site is replicating from a primary site that uses `repmgr` and `PgBouncer`, [replicating through PgBouncer is not supported](https://github.com/pgbouncer/pgbouncer/issues/382#issuecomment-517911529) and the secondary must replicate directly from the leader node in the `repmgr` cluster. Therefore, when there is a failover in the `repmgr` cluster, you will need to manually re-point your secondary site to replicate from the new leader with: + +```shell +sudo gitlab-ctl replicate-geo-database --host=<new_leader_ip> --replication-slot=<slot_name> +``` + +Otherwise, the replication will not happen anymore, even if the original node gets re-added as a follower node. This will re-sync your secondary site database and may take a long time depending on the amount of data to sync. You may also need to run `gitlab-ctl reconfigure` if replication is still not working after re-syncing. + +### Repmgr Restore procedure + +If a node fails, it can be removed from the cluster, or added back as a standby +after it has been restored to service. + +#### Remove a standby from the cluster + + From any other node in the cluster, run: + + ```shell + gitlab-ctl repmgr standby unregister --node=X + ``` + + where X is the value of node in `repmgr.conf` on the old server. + + To find this, you can use: + + ```shell + awk -F = '$1 == "node" { print $2 }' /var/opt/gitlab/postgresql/repmgr.conf + ``` + + It will output something like: + + ```plaintext + 959789412 + ``` + + Then you will use this ID to unregister the node: + + ```shell + gitlab-ctl repmgr standby unregister --node=959789412 + ``` + +#### Add a node as a standby server + + From the standby node, run: + + ```shell + gitlab-ctl repmgr standby follow NEW_MASTER + gitlab-ctl restart repmgrd + ``` + + WARNING: + When the server is brought back online, and before + you switch it to a standby node, repmgr will report that there are two masters. + If there are any clients that are still attempting to write to the old master, + this will cause a split, and the old master will need to be resynced from + scratch by performing a `gitlab-ctl repmgr standby setup NEW_MASTER`. + +#### Add a failed master back into the cluster as a standby node + + Once `repmgrd` and PostgreSQL are running, the node will need to follow the new + as a standby node. + + ```shell + gitlab-ctl repmgr standby follow NEW_MASTER + ``` + + Once the node is following the new master as a standby, the node needs to be + [unregistered from the cluster on the new master node](#remove-a-standby-from-the-cluster). + + Once the old master node has been unregistered from the cluster, it will need + to be setup as a new standby: + + ```shell + gitlab-ctl repmgr standby setup NEW_MASTER + ``` + + Failure to unregister and read the old master node can lead to subsequent failovers + not working. + +### Alternate configurations + +#### Database authorization + +By default, we give any host on the database network the permission to perform +repmgr operations using PostgreSQL's `trust` method. If you do not want this +level of trust, there are alternatives. + +You can trust only the specific nodes that will be database clusters, or you +can require md5 authentication. + +#### Trust specific addresses + +If you know the IP address, or FQDN of all database and PgBouncer nodes in the +cluster, you can trust only those nodes. + +In `/etc/gitlab/gitlab.rb` on all of the database nodes, set +`repmgr['trust_auth_cidr_addresses']` to an array of strings containing all of +the addresses. + +If setting to a node's FQDN, they must have a corresponding PTR record in DNS. +If setting to a node's IP address, specify it as `XXX.XXX.XXX.XXX/32`. + +For example: + +```ruby +repmgr['trust_auth_cidr_addresses'] = %w(192.168.1.44/32 db2.example.com) +``` + +#### MD5 Authentication + +If you are running on an untrusted network, repmgr can use md5 authentication +with a [`.pgpass` file](https://www.postgresql.org/docs/11/libpq-pgpass.html) +to authenticate. + +You can specify by IP address, FQDN, or by subnet, using the same format as in +the previous section: + +1. On the current master node, create a password for the `gitlab` and + `gitlab_repmgr` user: + + ```shell + gitlab-psql -d template1 + template1=# \password gitlab_repmgr + Enter password: **** + Confirm password: **** + template1=# \password gitlab + ``` + +1. On each database node: + + 1. Edit `/etc/gitlab/gitlab.rb`: + 1. Ensure `repmgr['trust_auth_cidr_addresses']` is **not** set + 1. Set `postgresql['md5_auth_cidr_addresses']` to the desired value + 1. Set `postgresql['sql_replication_user'] = 'gitlab_repmgr'` + 1. Reconfigure with `gitlab-ctl reconfigure` + 1. Restart PostgreSQL with `gitlab-ctl restart postgresql` + + 1. Create a `.pgpass` file. Enter the `gitlab_repmgr` password twice to + when asked: + + ```shell + gitlab-ctl write-pgpass --user gitlab_repmgr --hostuser gitlab-psql --database '*' + ``` + +1. On each PgBouncer node, edit `/etc/gitlab/gitlab.rb`: + 1. Ensure `gitlab_rails['db_password']` is set to the plaintext password for + the `gitlab` database user + 1. [Reconfigure GitLab](../restart_gitlab.md#omnibus-gitlab-reconfigure) for the changes to take effect + +## Troubleshooting + +### Consul and PostgreSQL changes not taking effect + +Due to the potential impacts, `gitlab-ctl reconfigure` only reloads Consul and PostgreSQL, it will not restart the services. However, not all changes can be activated by reloading. + +To restart either service, run `gitlab-ctl restart SERVICE` + +For PostgreSQL, it is usually safe to restart the master node by default. Automatic failover defaults to a 1 minute timeout. Provided the database returns before then, nothing else needs to be done. To be safe, you can stop `repmgrd` on the standby nodes first with `gitlab-ctl stop repmgrd`, then start afterwards with `gitlab-ctl start repmgrd`. + +On the Consul server nodes, it is important to [restart the Consul service](../consul.md#restart-consul) in a controlled manner. + +### `gitlab-ctl repmgr-check-master` command produces errors + +If this command displays errors about database permissions it is likely that something failed during +install, resulting in the `gitlab-consul` database user getting incorrect permissions. Follow these +steps to fix the problem: + +1. On the master database node, connect to the database prompt - `gitlab-psql -d template1` +1. Delete the `gitlab-consul` user - `DROP USER "gitlab-consul";` +1. Exit the database prompt - `\q` +1. [Reconfigure GitLab](../restart_gitlab.md#omnibus-gitlab-reconfigure) and the user will be re-added with the proper permissions. +1. Change to the `gitlab-consul` user - `su - gitlab-consul` +1. Try the check command again - `gitlab-ctl repmgr-check-master`. + +Now there should not be errors. If errors still occur then there is another problem. + +### PgBouncer error `ERROR: pgbouncer cannot connect to server` + +You may get this error when running `gitlab-rake gitlab:db:configure` or you +may see the error in the PgBouncer log file. + +```plaintext +PG::ConnectionBad: ERROR: pgbouncer cannot connect to server +``` + +The problem may be that your PgBouncer node's IP address is not included in the +`trust_auth_cidr_addresses` setting in `/etc/gitlab/gitlab.rb` on the database nodes. + +You can confirm that this is the issue by checking the PostgreSQL log on the master +database node. If you see the following error then `trust_auth_cidr_addresses` +is the problem. + +```plaintext +2018-03-29_13:59:12.11776 FATAL: no pg_hba.conf entry for host "123.123.123.123", user "pgbouncer", database "gitlabhq_production", SSL off +``` + +To fix the problem, add the IP address to `/etc/gitlab/gitlab.rb`. + +```ruby +postgresql['trust_auth_cidr_addresses'] = %w(123.123.123.123/32 <other_cidrs>) +``` + +[Reconfigure GitLab](../restart_gitlab.md#omnibus-gitlab-reconfigure) for the changes to take effect. + +### Issues with other components + +If you're running into an issue with a component not outlined here, be sure to check the troubleshooting section of their specific documentation page: + +- [Consul](../consul.md#troubleshooting-consul) +- [PostgreSQL](https://docs.gitlab.com/omnibus/settings/database.html#troubleshooting) |