summaryrefslogtreecommitdiff
path: root/doc/administration/geo/replication/troubleshooting.md
diff options
context:
space:
mode:
Diffstat (limited to 'doc/administration/geo/replication/troubleshooting.md')
-rw-r--r--doc/administration/geo/replication/troubleshooting.md384
1 files changed, 384 insertions, 0 deletions
diff --git a/doc/administration/geo/replication/troubleshooting.md b/doc/administration/geo/replication/troubleshooting.md
new file mode 100644
index 00000000000..6fea03cc8ec
--- /dev/null
+++ b/doc/administration/geo/replication/troubleshooting.md
@@ -0,0 +1,384 @@
+# Geo Troubleshooting
+
+NOTE: **Note:**
+This list is an attempt to document all the moving parts that can go wrong.
+We are working into getting all this steps verified automatically in a
+rake task in the future.
+
+Setting up Geo requires careful attention to details and sometimes it's easy to
+miss a step. Here is a list of questions you should ask to try to detect
+what you need to fix (all commands and path locations are for Omnibus installs):
+
+## First check the health of the **secondary** node
+
+Visit the **primary** node's **Admin Area > Geo** (`/admin/geo/nodes`) in
+your browser. We perform the following health checks on each **secondary** node
+to help identify if something is wrong:
+
+- Is the node running?
+- Is the node's secondary database configured for streaming replication?
+- Is the node's secondary tracking database configured?
+- Is the node's secondary tracking database connected?
+- Is the node's secondary tracking database up-to-date?
+
+![Geo health check](img/geo_node_healthcheck.png)
+
+There is also an option to check the status of the **secondary** node by running a special rake task:
+
+```sh
+sudo gitlab-rake geo:status
+```
+
+## Is Postgres replication working?
+
+### Are my nodes pointing to the correct database instance?
+
+You should make sure your **primary** Geo node points to the instance with
+writing permissions.
+
+Any **secondary** nodes should point only to read-only instances.
+
+### Can Geo detect my current node correctly?
+
+Geo uses the defined node from the **Admin Area > Geo** screen, and tries to match
+it with the value defined in the `/etc/gitlab/gitlab.rb` configuration file.
+The relevant line looks like: `external_url "http://gitlab.example.com"`.
+
+To check if the node on the current machine is correctly detected type:
+
+```sh
+sudo gitlab-rails runner "puts Gitlab::Geo.current_node.inspect"
+```
+
+and expect something like:
+
+```
+#<GeoNode id: 2, schema: "https", host: "gitlab.example.com", port: 443, relative_url_root: "", primary: false, ...>
+```
+
+By running the command above, `primary` should be `true` when executed in
+the **primary** node, and `false` on any **secondary** node.
+
+## How do I fix the message, "ERROR: replication slots can only be used if max_replication_slots > 0"?
+
+This means that the `max_replication_slots` PostgreSQL variable needs to
+be set on the **primary** database. In GitLab 9.4, we have made this setting
+default to 1. You may need to increase this value if you have more
+**secondary** nodes. Be sure to restart PostgreSQL for this to take
+effect. See the [PostgreSQL replication
+setup][database-pg-replication] guide for more details.
+
+## How do I fix the message, "FATAL: could not start WAL streaming: ERROR: replication slot "geo_secondary_my_domain_com" does not exist"?
+
+This occurs when PostgreSQL does not have a replication slot for the
+**secondary** node by that name. You may want to rerun the [replication
+process](database.md) on the **secondary** node .
+
+## How do I fix the message, "Command exceeded allowed execution time" when setting up replication?
+
+This may happen while [initiating the replication process][database-start-replication] on the **secondary** node,
+and indicates that your initial dataset is too large to be replicated in the default timeout (30 minutes).
+
+Re-run `gitlab-ctl replicate-geo-database`, but include a larger value for
+`--backup-timeout`:
+
+```sh
+sudo gitlab-ctl replicate-geo-database --host=primary.geo.example.com --slot-name=secondary_geo_example_com --backup-timeout=21600
+```
+
+This will give the initial replication up to six hours to complete, rather than
+the default thirty minutes. Adjust as required for your installation.
+
+## How do I fix the message, "PANIC: could not write to file 'pg_xlog/xlogtemp.123': No space left on device"
+
+Determine if you have any unused replication slots in the **primary** database. This can cause large amounts of
+log data to build up in `pg_xlog`. Removing the unused slots can reduce the amount of space used in the `pg_xlog`.
+
+1. Start a PostgreSQL console session:
+
+ ```sh
+ sudo gitlab-psql gitlabhq_production
+ ```
+
+ > Note that using `gitlab-rails dbconsole` will not work, because managing replication slots requires superuser permissions.
+
+1. View your replication slots with:
+
+ ```sql
+ SELECT * FROM pg_replication_slots;
+ ```
+
+Slots where `active` is `f` are not active.
+
+- When this slot should be active, because you have a **secondary** node configured using that slot,
+ log in to that **secondary** node and check the PostgreSQL logs why the replication is not running.
+
+- If you are no longer using the slot (e.g. you no longer have Geo enabled), you can remove it with in the
+ PostgreSQL console session:
+
+ ```sql
+ SELECT pg_drop_replication_slot('name_of_extra_slot');
+ ```
+
+## Very large repositories never successfully synchronize on the **secondary** node
+
+GitLab places a timeout on all repository clones, including project imports
+and Geo synchronization operations. If a fresh `git clone` of a repository
+on the primary takes more than a few minutes, you may be affected by this.
+To increase the timeout, add the following line to `/etc/gitlab/gitlab.rb`
+on the **secondary** node:
+
+```ruby
+gitlab_rails['gitlab_shell_git_timeout'] = 10800
+```
+
+Then reconfigure GitLab:
+
+```sh
+sudo gitlab-ctl reconfigure
+```
+
+This will increase the timeout to three hours (10800 seconds). Choose a time
+long enough to accommodate a full clone of your largest repositories.
+
+## How to reset Geo **secondary** node replication
+
+If you get a **secondary** node in a broken state and want to reset the replication state,
+to start again from scratch, there are a few steps that can help you:
+
+1. Stop Sidekiq and the Geo LogCursor
+
+ It's possible to make Sidekiq stop gracefully, but making it stop getting new jobs and
+ wait until the current jobs to finish processing.
+
+ You need to send a **SIGTSTP** kill signal for the first phase and them a **SIGTERM**
+ when all jobs have finished. Otherwise just use the `gitlab-ctl stop` commands.
+
+ ```sh
+ gitlab-ctl status sidekiq
+ # run: sidekiq: (pid 10180) <- this is the PID you will use
+ kill -TSTP 10180 # change to the correct PID
+
+ gitlab-ctl stop sidekiq
+ gitlab-ctl stop geo-logcursor
+ ```
+
+ You can watch sidekiq logs to know when sidekiq jobs processing have finished:
+
+ ```sh
+ gitlab-ctl tail sidekiq
+ ```
+
+1. Rename repository storage folders and create new ones
+
+ ```sh
+ mv /var/opt/gitlab/git-data/repositories /var/opt/gitlab/git-data/repositories.old
+ mkdir -p /var/opt/gitlab/git-data/repositories
+ chown git:git /var/opt/gitlab/git-data/repositories
+ ```
+
+ TIP: **Tip**
+ You may want to remove the `/var/opt/gitlab/git-data/repositories.old` in the future
+ as soon as you confirmed that you don't need it anymore, to save disk space.
+
+1. _(Optional)_ Rename other data folders and create new ones
+
+ CAUTION: **Caution**:
+ You may still have files on the **secondary** node that have been removed from **primary** node but
+ removal have not been reflected. If you skip this step, they will never be removed
+ from this Geo node.
+
+ Any uploaded content like file attachments, avatars or LFS objects are stored in a
+ subfolder in one of the two paths below:
+
+ 1. /var/opt/gitlab/gitlab-rails/shared
+ 1. /var/opt/gitlab/gitlab-rails/uploads
+
+ To rename all of them:
+
+ ```sh
+ gitlab-ctl stop
+
+ mv /var/opt/gitlab/gitlab-rails/shared /var/opt/gitlab/gitlab-rails/shared.old
+ mkdir -p /var/opt/gitlab/gitlab-rails/shared
+
+ mv /var/opt/gitlab/gitlab-rails/uploads /var/opt/gitlab/gitlab-rails/uploads.old
+ mkdir -p /var/opt/gitlab/gitlab-rails/uploads
+ ```
+
+ Reconfigure in order to recreate the folders and make sure permissions and ownership
+ are correctly
+
+ ```sh
+ gitlab-ctl reconfigure
+ ```
+
+1. Reset the Tracking Database
+
+ ```sh
+ gitlab-rake geo:db:reset
+ ```
+
+1. Restart previously stopped services
+
+ ```sh
+ gitlab-ctl start
+ ```
+
+## How do I fix a "Foreign Data Wrapper (FDW) is not configured" error?
+
+When setting up Geo, you might see this warning in the `gitlab-rake
+gitlab:geo:check` output:
+
+```
+GitLab Geo tracking database Foreign Data Wrapper schema is up-to-date? ... foreign data wrapper is not configured
+```
+
+There are a few key points to remember:
+
+1. The FDW settings are configured on the Geo **tracking** database.
+1. The configured foreign server enables a login to the Geo
+**secondary**, read-only database.
+
+By default, the Geo secondary and tracking database are running on the
+same host on different ports. That is, 5432 and 5431 respectively.
+
+### Checking configuration
+
+NOTE: **Note:**
+The following steps are for Omnibus installs only. Using Geo with source-based installs [is deprecated](index.md#using-gitlab-installed-from-source-deprecated).
+
+To check the configuration:
+
+1. Enter the database console:
+
+ ```sh
+ gitlab-geo-psql
+ ```
+
+1. Check whether any tables are present. If everything is working, you
+should see something like this:
+
+ ```sql
+ gitlabhq_geo_production=# SELECT * from information_schema.foreign_tables;
+ foreign_table_catalog | foreign_table_schema | foreign_table_name | foreign_server_catalog | foreign_server_n
+ ame
+ -------------------------+----------------------+-------------------------------------------------+-------------------------+-----------------
+ ----
+ gitlabhq_geo_production | gitlab_secondary | abuse_reports | gitlabhq_geo_production | gitlab_secondary
+ gitlabhq_geo_production | gitlab_secondary | appearances | gitlabhq_geo_production | gitlab_secondary
+ gitlabhq_geo_production | gitlab_secondary | application_setting_terms | gitlabhq_geo_production | gitlab_secondary
+ gitlabhq_geo_production | gitlab_secondary | application_settings | gitlabhq_geo_production | gitlab_secondary
+ <snip>
+ ```
+
+ However, if the query returns with `0 rows`, then continue onto the next steps.
+
+1. Check that the foreign server mapping is correct via `\des+`. The
+ results should look something like this:
+
+ ```sql
+ gitlabhq_geo_production=# \des+
+ List of foreign servers
+ -[ RECORD 1 ]--------+------------------------------------------------------------
+ Name | gitlab_secondary
+ Owner | gitlab-psql
+ Foreign-data wrapper | postgres_fdw
+ Access privileges | "gitlab-psql"=U/"gitlab-psql" +
+ | gitlab_geo=U/"gitlab-psql"
+ Type |
+ Version |
+ FDW Options | (host '0.0.0.0', port '5432', dbname 'gitlabhq_production')
+ Description |
+ ```
+
+ NOTE: **Note:** Pay particular attention to the host and port under
+ FDW options. That configuration should point to the Geo secondary
+ database.
+
+ If you need to experiment with changing the host or password, the
+ following queries demonstrate how:
+
+ ```sql
+ ALTER SERVER gitlab_secondary OPTIONS (SET host 'my-new-host');
+ ALTER SERVER gitlab_secondary OPTIONS (SET port 5432);
+ ```
+
+ If you change the host and/or port, you will also have to adjust the
+ following settings in `/etc/gitlab/gitlab.rb` and run `gitlab-ctl
+ reconfigure`:
+
+ - `gitlab_rails['db_host']`
+ - `gitlab_rails['db_port']`
+
+1. Check that the user mapping is configured properly via `\deu+`:
+
+ ```sql
+ gitlabhq_geo_production=# \deu+
+ List of user mappings
+ Server | User name | FDW Options
+ ------------------+------------+--------------------------------------------------------------------------------
+ gitlab_secondary | gitlab_geo | ("user" 'gitlab', password 'YOUR-PASSWORD-HERE')
+ (1 row)
+ ```
+
+ Make sure the password is correct. You can test that logins work by running `psql`:
+
+ ```sh
+ # Connect to the tracking database as the `gitlab_geo` user
+ sudo -u git /opt/gitlab/embedded/bin/psql -h /var/opt/gitlab/geo-postgresql -p 5431 -U gitlab_geo -W -d gitlabhq_geo_production
+ ```
+
+ If you need to correct the password, the following query shows how:
+
+ ```sql
+ ALTER USER MAPPING FOR gitlab_geo SERVER gitlab_secondary OPTIONS (SET password 'my-new-password');
+ ```
+
+ If you change the user or password, you will also have to adjust the
+ following settings in `/etc/gitlab/gitlab.rb` and run `gitlab-ctl
+ reconfigure`:
+
+ - `gitlab_rails['db_username']`
+ - `gitlab_rails['db_password']`
+
+ If you are using [PgBouncer in front of the secondary
+ database](database.md#pgbouncer-support-optional), be sure to update
+ the following settings:
+
+ - `geo_postgresql['fdw_external_user']`
+ - `geo_postgresql['fdw_external_password']`
+
+### Manual reload of FDW schema
+
+If you're still unable to get FDW working, you may want to try a manual
+reload of the FDW schema. To manually reload the FDW schema:
+
+1. On the node running the Geo tracking database, enter the PostgreSQL console via
+ the `gitlab_geo` user:
+
+ ```sh
+ sudo -u git /opt/gitlab/embedded/bin/psql -h /var/opt/gitlab/geo-postgresql -p 5431 -U gitlab_geo -W -d gitlabhq_geo_production
+ ```
+
+ Be sure to adjust the port and hostname for your configuration. You
+ may be asked to enter a password.
+
+1. Reload the schema via:
+
+ ```sql
+ DROP SCHEMA IF EXISTS gitlab_secondary CASCADE;
+ CREATE SCHEMA gitlab_secondary;
+ GRANT USAGE ON FOREIGN SERVER gitlab_secondary TO gitlab_geo;
+ IMPORT FOREIGN SCHEMA public FROM SERVER gitlab_secondary INTO gitlab_secondary;
+ ```
+
+1. Test that queries work:
+
+ ```sql
+ SELECT * from information_schema.foreign_tables;
+ SELECT * FROM gitlab_secondary.projects limit 1;
+ ```
+
+[database-start-replication]: database.md#step-3-initiate-the-replication-process
+[database-pg-replication]: database.md#postgresql-replication