diff options
author | Achilleas Pipinellis <axil@gitlab.com> | 2019-06-21 10:33:38 +0000 |
---|---|---|
committer | Achilleas Pipinellis <axil@gitlab.com> | 2019-06-21 10:33:38 +0000 |
commit | c10bde1ff088d0b744ce98b28ee6faa16b0eda34 (patch) | |
tree | e9351bbc43bbd56a9be87ffc8401105317289420 | |
parent | a53fc9f1cf04609a1f7a1704568111d2a89684a4 (diff) | |
parent | b765b8e3b2fd4565cf48cd883c18d13893837364 (diff) | |
download | gitlab-ce-c10bde1ff088d0b744ce98b28ee6faa16b0eda34.tar.gz |
Merge branch 'docs/add-another-fdw-troubleshoot-item' into 'master'
Add new troubleshooting step and refactor Geo replication docs
Closes #63210
See merge request gitlab-org/gitlab-ce!29783
-rw-r--r-- | doc/administration/geo/replication/troubleshooting.md | 108 |
1 files changed, 76 insertions, 32 deletions
diff --git a/doc/administration/geo/replication/troubleshooting.md b/doc/administration/geo/replication/troubleshooting.md index c5bdd36ba70..846afd8f5f4 100644 --- a/doc/administration/geo/replication/troubleshooting.md +++ b/doc/administration/geo/replication/troubleshooting.md @@ -1,15 +1,23 @@ # Geo Troubleshooting **[PREMIUM ONLY]** -NOTE: **Note:** -This list is an attempt to document all the moving parts that can go wrong. -We are working into getting all this steps verified automatically in a -rake task in the future. - Setting up Geo requires careful attention to details and sometimes it's easy to -miss a step. Here is a list of questions you should ask to try to detect -what you need to fix (all commands and path locations are for Omnibus installs): +miss a step. + +Here is a list of steps you should take to attempt to fix problem: + +- Perform [basic troubleshooting](#basic-troubleshooting). +- Fix any [replication errors](#fixing-replication-errors). +- Fix any [Foreign Data Wrapper](#fixing-foreign-data-wrapper-errors) errors. +- Fix any [common](#fixing-common-errors) errors. -## First check the health of the **secondary** node +## Basic troubleshooting + +Before attempting more advanced troubleshooting: + +- Check [the health of the **secondary** node](#check-the-health-of-the-secondary-node). +- Check [if PostgreSQL replication is working](#check-if-postgresql-replication-is-working). + +### Check the health of the **secondary** node Visit the **primary** node's **Admin Area > Geo** (`/admin/geo/nodes`) in your browser. We perform the following health checks on each **secondary** node @@ -23,10 +31,12 @@ to help identify if something is wrong: ![Geo health check](img/geo_node_healthcheck.png) -For information on how to resolve common errors reported from the UI, see [common errors](#common-errors). +For information on how to resolve common errors reported from the UI, see +[Fixing Common Errors](#fixing-common-errors). If the UI is not working, or you are unable to log in, you can run the Geo health check manually to get this information as well as a few more details. + This rake task can be run on an app node in the **primary** or **secondary** Geo nodes: @@ -36,7 +46,7 @@ sudo gitlab-rake gitlab:geo:check Example output: -``` +```text Checking Geo ... GitLab Geo is available ... yes @@ -68,7 +78,7 @@ sudo gitlab-rake geo:status Example output: -``` +```text http://secondary.example.com/ ----------------------------------------------------- GitLab Version: 11.10.4-ee @@ -89,16 +99,21 @@ http://secondary.example.com/ Last status report was: 2 minutes ago ``` -## Is Postgres replication working? +### Check if PostgreSQL replication is working + +To check if PostgreSQL replication is working, check if: + +- [Nodes are pointing to the correct database instance](#are-nodes-pointing-to-the-correct-database-instance). +- [Geo can detect the current node correctly](#can-geo-detect-the-current-node-correctly). -### Are my nodes pointing to the correct database instance? +#### Are nodes pointing to the correct database instance? You should make sure your **primary** Geo node points to the instance with writing permissions. Any **secondary** nodes should point only to read-only instances. -### Can Geo detect my current node correctly? +#### Can Geo detect the current node correctly? Geo uses the defined node from the **Admin Area > Geo** screen, and tries to match it with the value defined in the `/etc/gitlab/gitlab.rb` configuration file. @@ -112,29 +127,38 @@ sudo gitlab-rails runner "puts Gitlab::Geo.current_node.inspect" and expect something like: -``` +```ruby #<GeoNode id: 2, schema: "https", host: "gitlab.example.com", port: 443, relative_url_root: "", primary: false, ...> ``` By running the command above, `primary` should be `true` when executed in the **primary** node, and `false` on any **secondary** node. -## How do I fix the message, "ERROR: replication slots can only be used if max_replication_slots > 0"? +## Fixing replication errors + +The following sections outline troubleshooting steps for fixing replication +errors. + +### Message: "ERROR: replication slots can only be used if max_replication_slots > 0"? This means that the `max_replication_slots` PostgreSQL variable needs to be set on the **primary** database. In GitLab 9.4, we have made this setting default to 1. You may need to increase this value if you have more -**secondary** nodes. Be sure to restart PostgreSQL for this to take +**secondary** nodes. + +Be sure to restart PostgreSQL for this to take effect. See the [PostgreSQL replication setup][database-pg-replication] guide for more details. -## How do I fix the message, "FATAL: could not start WAL streaming: ERROR: replication slot "geo_secondary_my_domain_com" does not exist"? +### Message: "FATAL: could not start WAL streaming: ERROR: replication slot "geo_secondary_my_domain_com" does not exist"? This occurs when PostgreSQL does not have a replication slot for the -**secondary** node by that name. You may want to rerun the [replication +**secondary** node by that name. + +You may want to rerun the [replication process](database.md) on the **secondary** node . -## How do I fix the message, "Command exceeded allowed execution time" when setting up replication? +### Message: "Command exceeded allowed execution time" when setting up replication? This may happen while [initiating the replication process][database-start-replication] on the **secondary** node, and indicates that your initial dataset is too large to be replicated in the default timeout (30 minutes). @@ -153,7 +177,7 @@ sudo gitlab-ctl \ This will give the initial replication up to six hours to complete, rather than the default thirty minutes. Adjust as required for your installation. -## How do I fix the message, "PANIC: could not write to file 'pg_xlog/xlogtemp.123': No space left on device" +### Message: "PANIC: could not write to file 'pg_xlog/xlogtemp.123': No space left on device" Determine if you have any unused replication slots in the **primary** database. This can cause large amounts of log data to build up in `pg_xlog`. Removing the unused slots can reduce the amount of space used in the `pg_xlog`. @@ -184,11 +208,12 @@ Slots where `active` is `f` are not active. SELECT pg_drop_replication_slot('<name_of_extra_slot>'); ``` -## Very large repositories never successfully synchronize on the **secondary** node +### Very large repositories never successfully synchronize on the **secondary** node GitLab places a timeout on all repository clones, including project imports and Geo synchronization operations. If a fresh `git clone` of a repository on the primary takes more than a few minutes, you may be affected by this. + To increase the timeout, add the following line to `/etc/gitlab/gitlab.rb` on the **secondary** node: @@ -205,7 +230,7 @@ sudo gitlab-ctl reconfigure This will increase the timeout to three hours (10800 seconds). Choose a time long enough to accommodate a full clone of your largest repositories. -## How to reset Geo **secondary** node replication +### Reseting Geo **secondary** node replication If you get a **secondary** node in a broken state and want to reset the replication state, to start again from scratch, there are a few steps that can help you: @@ -289,12 +314,16 @@ to start again from scratch, there are a few steps that can help you: gitlab-ctl start ``` -## How do I fix a "Foreign Data Wrapper (FDW) is not configured" error? +## Fixing Foreign Data Wrapper errors + +This section documents ways to fix potential Foreign Data Wrapper errors. + +### "Foreign Data Wrapper (FDW) is not configured" error When setting up Geo, you might see this warning in the `gitlab-rake gitlab:geo:check` output: -``` +```text GitLab Geo tracking database Foreign Data Wrapper schema is up-to-date? ... foreign data wrapper is not configured ``` @@ -307,7 +336,7 @@ There are a few key points to remember: By default, the Geo secondary and tracking database are running on the same host on different ports. That is, 5432 and 5431 respectively. -### Checking configuration +#### Checking configuration NOTE: **Note:** The following steps are for Omnibus installs only. Using Geo with source-based installs was **deprecated** in GitLab 11.5. @@ -419,7 +448,7 @@ should see something like this: - `geo_postgresql['fdw_external_user']` - `geo_postgresql['fdw_external_password']` -### Manual reload of FDW schema +#### Manual reload of FDW schema If you're still unable to get FDW working, you may want to try a manual reload of the FDW schema. To manually reload the FDW schema: @@ -459,9 +488,25 @@ reload of the FDW schema. To manually reload the FDW schema: [database-start-replication]: database.md#step-3-initiate-the-replication-process [database-pg-replication]: database.md#postgresql-replication -## Common errors +### "Geo database has an outdated FDW remote schema" error + +GitLab can error with a `Geo database has an outdated FDW remote schema` message. + +For example: -This section documents common errors reported in the admin UI and how to fix them. +```text +Geo database has an outdated FDW remote schema. It contains 229 of 236 expected tables. Please refer to Geo Troubleshooting. +``` + +To resolve this, run the following command: + +```sh +sudo gitlab-rake geo:db:refresh_foreign_tables +``` + +## Fixing common errors + +This section documents common errors reported in the Admin UI and how to fix them. ### Geo database configuration file is missing @@ -470,7 +515,6 @@ GitLab cannot find or doesn't have permission to access the `database_geo.yml` c In an Omnibus GitLab installation, the file should be in `/var/opt/gitlab/gitlab-rails/etc`. If it doesn't exist or inadvertent changes have been made to it, run `sudo gitlab-ctl reconfigure` to restore it to its correct state. - If this path is mounted on a remote volume, please check your volume configuration and that it has correct permissions. ### Geo node has a database that is writable which is an indication it is not configured for replication with the primary node. @@ -503,7 +547,7 @@ Make sure you follow the [Geo database replication](database.md) instructions fo If you are using GitLab Omnibus installation, something might have failed during upgrade. You can: -- Run `sudo gitlab-ctl reconfigure`. +- Run `sudo gitlab-ctl reconfigure`. - Manually trigger the database migration by running: `sudo gitlab-rake geo:db:migrate` as root on the **secondary** node. ### Geo database is not configured to use Foreign Data Wrapper @@ -511,4 +555,4 @@ If you are using GitLab Omnibus installation, something might have failed during This error means the Geo Tracking Database doesn't have the FDW server and credentials configured. -See [How do I fix a "Foreign Data Wrapper (FDW) is not configured" error?](#how-do-i-fix-a-foreign-data-wrapper-fdw-is-not-configured-error). +See ["Foreign Data Wrapper (FDW) is not configured" error?](#foreign-data-wrapper-fdw-is-not-configured-error). |