summaryrefslogtreecommitdiff
path: root/doc/administration/high_availability/consul.md
diff options
context:
space:
mode:
Diffstat (limited to 'doc/administration/high_availability/consul.md')
-rw-r--r--doc/administration/high_availability/consul.md105
1 files changed, 105 insertions, 0 deletions
diff --git a/doc/administration/high_availability/consul.md b/doc/administration/high_availability/consul.md
new file mode 100644
index 00000000000..056b7fc15d9
--- /dev/null
+++ b/doc/administration/high_availability/consul.md
@@ -0,0 +1,105 @@
+# Working with the bundled Consul service **[PREMIUM ONLY]**
+
+## Overview
+
+As part of its High Availability stack, GitLab Premium includes a bundled version of [Consul](http://consul.io) that can be managed through `/etc/gitlab/gitlab.rb`.
+
+A Consul cluster consists of multiple server agents, as well as client agents that run on other nodes which need to talk to the consul cluster.
+
+## Operations
+
+### Checking cluster membership
+
+To see which nodes are part of the cluster, run the following on any member in the cluster
+```
+# /opt/gitlab/embedded/bin/consul members
+Node Address Status Type Build Protocol DC
+consul-b XX.XX.X.Y:8301 alive server 0.9.0 2 gitlab_consul
+consul-c XX.XX.X.Y:8301 alive server 0.9.0 2 gitlab_consul
+consul-c XX.XX.X.Y:8301 alive server 0.9.0 2 gitlab_consul
+db-a XX.XX.X.Y:8301 alive client 0.9.0 2 gitlab_consul
+db-b XX.XX.X.Y:8301 alive client 0.9.0 2 gitlab_consul
+```
+
+Ideally all nodes will have a `Status` of `alive`.
+
+### Restarting the server cluster
+
+**Note**: This section only applies to server agents. It is safe to restart client agents whenever needed.
+
+If it is necessary to restart the server cluster, it is important to do this in a controlled fashion in order to maintain quorum. If quorum is lost, you will need to follow the consul [outage recovery](#outage-recovery) process to recover the cluster.
+
+To be safe, we recommend you only restart one server agent at a time to ensure the cluster remains intact.
+
+For larger clusters, it is possible to restart multiple agents at a time. See the [Consul consensus document](https://www.consul.io/docs/internals/consensus.html#deployment-table) for how many failures it can tolerate. This will be the number of simulateneous restarts it can sustain.
+
+## Troubleshooting
+
+### Consul server agents unable to communicate
+
+By default, the server agents will attempt to [bind](https://www.consul.io/docs/agent/options.html#_bind) to '0.0.0.0', but they will advertise the first private IP address on the node for other agents to communicate with them. If the other nodes cannot communicate with a node on this address, then the cluster will have a failed status.
+
+You will see messages like the following in `gitlab-ctl tail consul` output if you are running into this issue:
+
+```
+2017-09-25_19:53:39.90821 2017/09/25 19:53:39 [WARN] raft: no known peers, aborting election
+2017-09-25_19:53:41.74356 2017/09/25 19:53:41 [ERR] agent: failed to sync remote state: No cluster leader
+```
+
+
+To fix this:
+
+1. Pick an address on each node that all of the other nodes can reach this node through.
+1. Update your `/etc/gitlab/gitlab.rb`
+
+ ```ruby
+ consul['configuration'] = {
+ ...
+ bind_addr: 'IP ADDRESS'
+ }
+ ```
+1. Run `gitlab-ctl reconfigure`
+
+If you still see the errors, you may have to [erase the consul database and reinitialize](#recreate-from-scratch) on the affected node.
+
+### Consul agents do not start - Multiple private IPs
+
+In the case that a node has multiple private IPs the agent be confused as to which of the private addresses to advertise, and then immediately exit on start.
+
+You will see messages like the following in `gitlab-ctl tail consul` output if you are running into this issue:
+
+```
+2017-11-09_17:41:45.52876 ==> Starting Consul agent...
+2017-11-09_17:41:45.53057 ==> Error creating agent: Failed to get advertise address: Multiple private IPs found. Please configure one.
+```
+
+To fix this:
+
+1. Pick an address on the node that all of the other nodes can reach this node through.
+1. Update your `/etc/gitlab/gitlab.rb`
+
+ ```ruby
+ consul['configuration'] = {
+ ...
+ bind_addr: 'IP ADDRESS'
+ }
+ ```
+1. Run `gitlab-ctl reconfigure`
+
+### Outage recovery
+
+If you lost enough server agents in the cluster to break quorum, then the cluster is considered failed, and it will not function without manual intervenetion.
+
+#### Recreate from scratch
+By default, GitLab does not store anything in the consul cluster that cannot be recreated. To erase the consul database and reinitialize
+
+```
+# gitlab-ctl stop consul
+# rm -rf /var/opt/gitlab/consul/data
+# gitlab-ctl start consul
+```
+
+After this, the cluster should start back up, and the server agents rejoin. Shortly after that, the client agents should rejoin as well.
+
+#### Recover a failed cluster
+If you have taken advantage of consul to store other data, and want to restore the failed cluster, please follow the [Consul guide](https://www.consul.io/docs/guides/outage.html) to recover a failed cluster.