diff options
Diffstat (limited to 'doc/administration/high_availability/consul.md')
-rw-r--r-- | doc/administration/high_availability/consul.md | 105 |
1 files changed, 105 insertions, 0 deletions
diff --git a/doc/administration/high_availability/consul.md b/doc/administration/high_availability/consul.md new file mode 100644 index 00000000000..056b7fc15d9 --- /dev/null +++ b/doc/administration/high_availability/consul.md @@ -0,0 +1,105 @@ +# Working with the bundled Consul service **[PREMIUM ONLY]** + +## Overview + +As part of its High Availability stack, GitLab Premium includes a bundled version of [Consul](http://consul.io) that can be managed through `/etc/gitlab/gitlab.rb`. + +A Consul cluster consists of multiple server agents, as well as client agents that run on other nodes which need to talk to the consul cluster. + +## Operations + +### Checking cluster membership + +To see which nodes are part of the cluster, run the following on any member in the cluster +``` +# /opt/gitlab/embedded/bin/consul members +Node Address Status Type Build Protocol DC +consul-b XX.XX.X.Y:8301 alive server 0.9.0 2 gitlab_consul +consul-c XX.XX.X.Y:8301 alive server 0.9.0 2 gitlab_consul +consul-c XX.XX.X.Y:8301 alive server 0.9.0 2 gitlab_consul +db-a XX.XX.X.Y:8301 alive client 0.9.0 2 gitlab_consul +db-b XX.XX.X.Y:8301 alive client 0.9.0 2 gitlab_consul +``` + +Ideally all nodes will have a `Status` of `alive`. + +### Restarting the server cluster + +**Note**: This section only applies to server agents. It is safe to restart client agents whenever needed. + +If it is necessary to restart the server cluster, it is important to do this in a controlled fashion in order to maintain quorum. If quorum is lost, you will need to follow the consul [outage recovery](#outage-recovery) process to recover the cluster. + +To be safe, we recommend you only restart one server agent at a time to ensure the cluster remains intact. + +For larger clusters, it is possible to restart multiple agents at a time. See the [Consul consensus document](https://www.consul.io/docs/internals/consensus.html#deployment-table) for how many failures it can tolerate. This will be the number of simulateneous restarts it can sustain. + +## Troubleshooting + +### Consul server agents unable to communicate + +By default, the server agents will attempt to [bind](https://www.consul.io/docs/agent/options.html#_bind) to '0.0.0.0', but they will advertise the first private IP address on the node for other agents to communicate with them. If the other nodes cannot communicate with a node on this address, then the cluster will have a failed status. + +You will see messages like the following in `gitlab-ctl tail consul` output if you are running into this issue: + +``` +2017-09-25_19:53:39.90821 2017/09/25 19:53:39 [WARN] raft: no known peers, aborting election +2017-09-25_19:53:41.74356 2017/09/25 19:53:41 [ERR] agent: failed to sync remote state: No cluster leader +``` + + +To fix this: + +1. Pick an address on each node that all of the other nodes can reach this node through. +1. Update your `/etc/gitlab/gitlab.rb` + + ```ruby + consul['configuration'] = { + ... + bind_addr: 'IP ADDRESS' + } + ``` +1. Run `gitlab-ctl reconfigure` + +If you still see the errors, you may have to [erase the consul database and reinitialize](#recreate-from-scratch) on the affected node. + +### Consul agents do not start - Multiple private IPs + +In the case that a node has multiple private IPs the agent be confused as to which of the private addresses to advertise, and then immediately exit on start. + +You will see messages like the following in `gitlab-ctl tail consul` output if you are running into this issue: + +``` +2017-11-09_17:41:45.52876 ==> Starting Consul agent... +2017-11-09_17:41:45.53057 ==> Error creating agent: Failed to get advertise address: Multiple private IPs found. Please configure one. +``` + +To fix this: + +1. Pick an address on the node that all of the other nodes can reach this node through. +1. Update your `/etc/gitlab/gitlab.rb` + + ```ruby + consul['configuration'] = { + ... + bind_addr: 'IP ADDRESS' + } + ``` +1. Run `gitlab-ctl reconfigure` + +### Outage recovery + +If you lost enough server agents in the cluster to break quorum, then the cluster is considered failed, and it will not function without manual intervenetion. + +#### Recreate from scratch +By default, GitLab does not store anything in the consul cluster that cannot be recreated. To erase the consul database and reinitialize + +``` +# gitlab-ctl stop consul +# rm -rf /var/opt/gitlab/consul/data +# gitlab-ctl start consul +``` + +After this, the cluster should start back up, and the server agents rejoin. Shortly after that, the client agents should rejoin as well. + +#### Recover a failed cluster +If you have taken advantage of consul to store other data, and want to restore the failed cluster, please follow the [Consul guide](https://www.consul.io/docs/guides/outage.html) to recover a failed cluster. |